Jan 10, 2016
In this post, I'm going to try and see if I can elicit any relationships between the explanatory features and the explained features (expenditure).
This part of the project serves to aid in getting to actually know the data, and in hypothesis making, hypothesis refinement, and model selection; it essentially lays the foundation for the "story" that the project will tell.
I find this stage of the modeling process to be one of the most interesting. It's hard to know if and how your priors about the problem you're looking at will influence the modeling process - I think they almost certainly do. At the same time, there's a great deal of dramatic tension - will my hunch about the problem be supported by the data or will the data show me that it's completely false? Will another insight rise to take it's place? Will there be any story at all? All of these outcomes are entirely possible at this stage, and it's quite nerve wracking.
In any case, I can do my best by examining the data for as much supporting (or refuting) evidence in as objective a manner as possible.
Here are some of the questions I'm hoping to answer at the end of this post:
- What explanatory features are predictive (if any)?
- What explains the statistical variation in the dataset? What does the randomness look like? Is there a pattern? Surely it's not all just white noise!
- Which features are mutually correlated? Do features combine to create stronger or weaker relationships?
Next, we've accumulated a lot of boilerplate code that I need to load to start up my modeling session at this point. Rather than subject you to a lot of scrolling, I'm simply going to save it to a python module called blogging.py, and import it in one step. If you'd like to take a look at the environment I'm importing to build the analysis in this post, it's here.
The next step in my analysis is to explore the data a bit more. I want to see if I can get any more insight into what the underlying data generating process is; the more I can rely on a clean look at the data, the better I'll be able to dispel my own incorrect assumptions about what is or should be happening.
One problem I do have to contend with in working with this dataset is that it comes as weighted survey data. Working with weighted data can be quite tricky. In the very first instance, since the unweighted data does not (by definition) accurately represent the population of interest, it's impossible to draw valid inferences from the standard exploratory (or even regression) procedure without accounting for the weights.
When constructing summary data, as in a histogram, or constructing a linear function of the data, this is not much of a problem. Any linear function $f(x)$ of the data is just as easy to compute as $f(Wx)$, where $W$ is a weight matrix, and variances will just be quadratic in $W$ as expected. However, there are myriad other challenges in working with weighted data; there's a relatively recent discussion of some of these challenges in this paper by Andrew Gelman.
For my immediate purposes - my Insight project - the challenge was to come up with some rough and ready ways to accommodate these weights. While the histogram methods in the Python ecosystem have options for specifying item level weights, not all EDA function calls do. In particular, scatterplots don't (as far as I know). Most model estimation calls that I looked into before starting this project did have some accommodations for item weights, so I figured that if I could get past the EDA stage, I might be fine.
I needed to find a way of getting a sense of the pairwise correlations in the dataset while accounting for the weights. In the end, I found that matplotlib hexbin plots had some support for item weights, and I figured that adding the weighted sum of points to hexbin cells would do well enough without my having to write my own visualization functions. In the end, this is what I constructed:
Let's try this out on a an example pair of features.
In this graph, greater (weighted) mass in any one hexbin results in a higher color count (according to the colorbar scale on the right). More tightly packed data results in a greater plot density.
There looks to be a general tendency for older people to spend more on office based services in this picture, but my sense of the trend is obscured by the outliers above 100k. Of these outliers, most are in the 70-80 year range, with one below 20 years. There's not enough data for me to make a robust case about whether age correlates well with the incidence of really high expenditure, so I'm going to try looking at this again after cutting off the outliers.
This is helpful. It looks like a lot of weighted mass (the aqua region at the bottom) is concentrated near zero expenditure. However, it highlights the fact that expenditure trends higher with age (as does variance in expenditure).
It looks like this is a fairly useful visualization technique as far as quick and dirty goes.
Here are the other groups of correlations I want to inspect:
- Income
- Utilization
- Health conditions
- Duration without insurance
- Age of diagnosis
- BMI
- Demographic information
- Expenditures.
I also want to look at the four expenditure categories at the same time. Here's a quick method to help me do that:
It looks like the same pattern holds for the other expenditure categories as for office expenditure when it comes to age.
Looking at income:
Here you get the result that while the level of spending moves away from zero as income rises, the variance of expenditure in all four categories declines with income. An important question is whether the reduced variation in spending at higher incomes is simply a result of having fewer people at those income levels, or if there is a systematic effect (more effective health management, better bargaining power).
Looking at how utilization correlates with expenditure:
I get the following out of these plots:
- Office based utilization and expenditure are positively correlated. Office based utilization is negatively correlated with all other expenditure.
- Outpatient utilization doesn't strongly correlate with outpatient expenditure. Outpatient utilization correlates somewhat negatively with other categories of expenditure.
- ER visits correlate negatively with other expenditures, and not particularly at all with ER expenditure.
- Inpatient visits correspond positively with inpatient utilization, and negatively with all other expenditure.
Next, I'll look at chronic health diagnoses. There are over 40 in my list, so it'll be useful to break them into groups.
Before we take a look at these plots, I want to quickly mention a very counterintuitive finding that I came across when I looked at these. For many of these chronic conditions, the subpopulation that does not have a positive diagnosis has a much heavier right tail than the subpopulation with the chronic condition.
On the face of it, it looked like people overall were at greater risk of spending more if they did not have a chronic illness than if they did. The paradox disappears when you zoom in on these plots around the first $\$10,000$ spent; it shows that many more people without a chronic condition are clustered at $\$0$ spent, relatively speaking. It's interesting to conjecture what may cause the thinner tails for chronic sufferers - one explanation may just be that the larger subpopulation is naturally accompanied by a larger variance.
The plots below shows the scatterplots to this limited scale for emphasis.
Conclusions:
- A cancer diagnosis overall drives expenditure for office based services and inpatient services, and has less of a clear effect on outpatient spending and ER spending. For these latter two categories, there is greater variance in spending for the subpopulation without a positive cancer diagnosis.
- There is very little mass in the graphs measuring spending against the various subdiagnoses of cancer. It seems likely that there will be a corresponding lack of statistical significance of these features in any analysis I do, and it may be well to leave those out and just use the overall cancer diagnosis feature.
Conclusions:
- Diagnoses of emphysema, bronchitis, and asthma drive spending on office services and inpatient services. They have less of an effect on outpatient and ER services. The right tails in these latter categories is heavier in the negative group.
- The other features do not seem to be as predictive of increased spending.
Conclusions:
- Diagnoses of high blood pressure, high cholesterol, coronary heart disease, angina, heart attack, other heart disease, stroke, and diabetes drive spending on office services and inpatient services, while having a limited effect on the other two categories of spending, in which they are associated with thinner right tails.
- A multiple diagnosis of high blood pressure is probably not very predictive.
Conclusions:
- Diagnoses of joint pain and arthritis drive spending on office services and inpatient services, while having a limited effect on the other two categories of spending, in which they are associated with thinner right tails.
Conclusions:
- Diagnoses of ADD/ADHD, limitations in physical functioning, and incidence of pregnancy drive spending on office services and inpatient services, while having a limited effect on the other two categories of spending, in which they are associated with thinner right tails.
Conclusion:
- It doesn't look like spending is correlated with time elapsed without health insurance.
Conclusion:
- There doesn't seem to be a strong correlation between the age of diagnosis for any of these chronic conditions and annual spending on any category of healthcare.
Conclusions:
- No features except insurance status generate significant differences in outpatient expenditure.
- ER spending is affected by race, marital status, employment status, self-employment status, holding multiple jobs, union status, whether you are offered health insurance by your employer, your overall insurance status (notably whether you have private insurance or a public option), whether your household uses an FSA, where you live (notably, there is more zero clustering for ER spending in the South, and less in the Northeast), and educational attainment.
- Both office and inpatient spending are affected by race, marital status, employment status, having more than one job, being self employed, union status, whether health insurance is offered through employment, whether public or private options for health insurance are used, whether FSA's are used, insurance status, census region, and educational attainment.
Finally, a look at whether body mass index affects spending:
From looking at this, there doesn't seem to be a strong correlation between spending and BMI in any category. I find this somewhat counterintuitive, since obesity is supposed to be correlated with poor cardiovascular health. I'm inclined to leave this feature in to see if there are any interesting interaction effects that may turn up in the modeling process.
At this point, I've been able to look at correlations between the explanatory features and spending.
Something else I'd like to do before I move on to making modeling decisions is to examine the pairwise correlations between the explanatory features. Apart from giving me a better view of the dataset and patterns within, I find it useful to get a sense of which features provide "orthogonal" signal. This is generally useful in building models that are based on least squares optimization, since parameter estimates tend to exhibit poor precision in the presence of mutually correlated regressors. While this can often be mitigated by regularizing the objective function (for example, penalizing the objective function by a scalar multiple of the $L_1$ norm of the parameter vector will generally send "redundant" parameter estimates to zero), it's nice to know why this might be happening behind the scenes.
A problem to take care of before hand is that almost none of the categorical features are coded in a way where it makes sense to compute a raw correlation coefficient. For example, the levels here:
Out[19]:
array([ 2, -1, 1, -7, -9, -8])
don't correlate linearly with the severity of the high cholesterol condition.
To more accurately get a sense of the correlations between the properties measured by these categorical features, I'm going to create new dummy variables (that take values either 0 or 1) that correspond to these properties.
I'll do this by groups of predictors:
This should allow me to get a decent sense of the correlations within the explanatory variables (both within these groups and overall).
Here's a heatmap of the correlations between the health status variables.
Conclusions:
- The arthritis, joint pain, diabetes, limitations in physical functioning and heart health features are mutually correlated. Presumably this arises from comorbidities in advanced age.
- The limitations in physical functioning feature is correlated with all chronic conditions ranging from cancer diagnosis to high cholesterol, bronchitis, high blood pressure and coronary heart disease.
- High cholesterol is strongly correlated to the incidence of high blood pressure.
While there are more interesting facts to be gleaned from this heatmap, one thing that strikes me is that the limitations in physical functioning feature is correlated with most chronic conditions. This may be useful to include in a model as a controlling feature in order to reduce omitted variables bias and best tease out the individual contributions to spending of the distinct chronic conditions. As a quick check to see if this might be a problem, let's see if the limitations in physical functioning feature correlates with spending in even one category:
While there's much less mass in the positive bucket, there seems to be much more zero clustering in the negative bucket. For the moment, I might suspect a possibility of introducing omitted variables bias if I were to leave out the limitations in physical functioning feature.
Next, here's how some of the demographic information correlates:
- There's a natural block of negative correlation between the various race categories - most people self-identify with only one of these, so any self identification with one category reduces the level of self-identification with the others.
- Being married seems to correlate with education level. Possibly this is just a consequence of greater age.
Let's look at the employment variables:
Conclusions:
- The occupation sectors most likely to offer health insurance seem to be education/health/social services and manufacturing.
- People whose employers offer health insurance are most likely to be insured for the whole year.
- People who have private insurance are likely to have it for the whole year.
Next, a look at what we get if we toss in employment, demographic and health variables into one heatmap:
This plot is quite rich in detail. Some conclusions:
- Higher educational attainment correlates with employers offering health insurance.
- High blood pressure, arthritis and heart disease correlate with being on public insurance. Age is likely the intermediating factor.
- Professional, health/social/education, management/business/financial and manufacturing occupations tend to offer employer insurance.
- Higher educational attainment correlates with likelihood of being employed.
Finally, a quick look at correlations between diagnosis ages for chronic conditions:
Conclusions:
- The age of diagnosis for each pair of (high cholesterol, high blood pressure), (high cholesterol, arthritis) and (heart attack, coronoary heart disease) are positively correlated.
To conclude this part, I've learned something about the structure of the explanatory features, and what it will mean for the design matrix in my model. Also, I think heatmaps are a really nice visualization tool.
At this point I have a better view of the data, and a sense of what features I can expect to be predictive. This is a decent foundation for a pretty rich modeling exercise.
In the next post, I'll write about the model selection process.
Click to read and post comments
Dec 24, 2015
In this post, I'm going to look at the explanatory features in the dataset.
Let's get our dataset, variables and functions from before.
AGE_AS_OF_12/31/13_(EDITED/IMPUTED)
SEX
RACE/ETHNICITY_(EDITED/IMPUTED)
MARITAL_STATUS-12/31/13_(EDITED/IMPUTED)
FAMILY'S_TOTAL_INCOME
ADULT_BODY_MASS_INDEX_(>17)_-_RD_5/3
HIGH_BLOOD_PRESSURE_DIAG_(>17)
CORONARY_HRT_DISEASE_DIAG_(>17)
AGE_OF_DIAGNOSIS-HIGH_BLOOD_PRESSURE
MULT_DIAG_HIGH_BLOOD_PRESS_(>17)
AGE_OF_DIAGNOSIS-CORONARY_HEART_DISEASE
ANGINA_DIAGNOSIS_(>17)
AGE_OF_DIAGNOSIS-ANGINA
HEART_ATTACK_(MI)_DIAG_(>17)
AGE_OF_DIAGNOSIS-HEART_ATTACK(MI)
OTHER_HEART_DISEASE_DIAG_(>17)
AGE_OF_DIAGNOSIS-OTHER_HEART_DISEASE
STROKE_DIAGNOSIS_(>17)
AGE_OF_DIAGNOSIS-STROKE
EMPHYSEMA_DIAGNOSIS_(>17)
AGE_OF_DIAGNOSIS-EMPHYSEMA
CHRONC_BRONCHITS_LAST_12_MTHS_(>17)-R3/1
CHRONC_BRONCHITS_LAST_12_MTHS_(>17)-R5/3
HIGH_CHOLESTEROL_DIAGNOSIS_(>17)
AGE_OF_DIAGNOSIS-HIGH_CHOLESTEROL
CANCER_DIAGNOSIS_(>17)
CANCER_DIAGNOSED_-_BLADDER_(>17)
CANCER_DIAGNOSED_-_BREAST_(>17)
CANCER_DIAGNOSED_-_CERVIX_(>17)
CANCER_DIAGNOSED_-_COLON_(>17)
CANCER_DIAGNOSED_-_LUNG_(>17)
CANCER_DIAGNOSED_-_LYMPHOMA_(>17)
CANCER_DIAGNOSED_-_MELANOMA_(>17)
CANCER_DIAGNOSED_-_OTHER_(>17)
CANCER_DIAGNOSED_-_PROSTATE_(>17)
CANCER_DIAGNOSED_-_SKIN-NONMELANO_(>17)
CANCER_DIAGNOSED-SKIN-UNKNOWN_TYPE_(>17)
CANCER_DIAGNOSED_-_UTERUS_(>17)
DIABETES_DIAGNOSIS_(>17)
AGE_OF_DIAGNOSIS-DIABETES
JOINT_PAIN_LAST_12_MONTHS_(>17)_-_RD_3/1
JOINT_PAIN_LAST_12_MONTHS_(>17)_-_RD_5/3
ARTHRITIS_DIAGNOSIS_(>17)
AGE_OF_DIAGNOSIS-ARTHRITIS
ASTHMA_DIAGNOSIS
AGE_OF_DIAGNOSIS-ASTHMA
DOES_PERSON_STILL_HAVE_ASTHMA-RD3/1
DOES_PERSON_STILL_HAVE_ASTHMA_-_RD_5/3
ASTHMA_ATTACK_LAST_12_MOS_-_RD3/1
USED_ACUTE_PRES_INHALER_LAST_3_MOS-RD5/3
USED>3ACUTE_CN_PRES_INH_LAST_3_MOS-RD5/3
EVER_USED_PREV_DAILY_ASTHMA_MEDS_-RD_5/3
NOW_TAKE_PREV_DAILY_ASTHMA_MEDS_-_RD_5/3
HAVE_PEAK_FLOW_METER_AT_HOME_-_RD_5/3
EVER_USED_PEAK_FLOW_METER_-_RD_5/3
ADHDADD_DIAGNOSIS_(5-17)
AGE_OF_DIAGNOSIS-ADHD/ADD
PREGNANT_DURING_REF_PERIOD_-_RD_3/1
LIMITATION_IN_PHYSICAL_FUNCTIONING-RD3/1
EMPLOYMENT_STATUS_RD_3/1
HAS_MORE_THAN_ONE_JOB_RD_3/1_INT_DATE
SELF-EMPLOYED_AT_RD_3/1_CMJ
CHOICE_OF_HEALTH_PLANS_AT_RD_3/1_CMJ
INDUSTRY_GROUP_RD_3/1_CMJ
UNION_STATUS_AT_RD_3/1_CMJ
OCCUPATION_GROUP_RD_3/1_CMJ
HEALTH_INSUR_HELD_FROM_RD_3/1_CMJ_(ED)
HEALTH_INSUR_OFFERED_BY_RD_3/1_CMJ_(ED)
EMPLOYER_OFFERS_HEALTH_INS_RD_3/1_CMJ
FULL_YEAR_INSURANCE_COVERAGE_STATUS_2013
ANYONE_IN_RU_HAVE_FSA_-_R3/1
#_WKS/MON_WOUT_HLTH_INS_PRV_YR-PN_18_ONL
PRIVATE_INSURANCE_ANY_TIME_IN_R5/R3
PUBLIC_INS_ANY_TIME_IN_R5/R3
INSURED_ANY_TIME_IN_R3/1
ANY_TIME_COVERAGE_BY_STATE_INS_-_R3/1
#_OFFICE-BASED_PROVIDER_VISITS_13
#_OUTPATIENT_DEPT_PROVIDER_VISITS_13
#_EMERGENCY_ROOM_VISITS_13
#_NIGHTS_IN_HOSP_FOR_DISCHARGES_2013
FINAL_PERSON_WEIGHT_2013
TOTAL_OFFICE-BASED_EXP_13
TOTAL_OUTPATIENT_PROVIDER_EXP_13
TOT_HOSP_IP_FACILITY_+_DR_EXP_13
TOTAL_ER_FACILITY_+_DR_EXP_13
CENSUS_REGION_AS_OF_12/31/13
EDUCATION_RECODE_(EDITED)
Next I'm going to prototype a function that will spit out the empirical distributions of each of the explanatory variables.
I'm going to have to treat categorical variables differently to continuous variables.
One rough and ready way to break out the categorical variables is to check how many levels that feature has in the dataset - if there are relatively few (say below a threshold of 10 or 20), I'll assume the feature is categorical, and otherwise I'll assume it's continuous. To deal with some of the edge cases, it's easy to note that anything involving an age, or a # of visits should be continuous.
Let's see if this works.
Let's see if this gave us what we wanted.
Continuous features
AGE_AS_OF_12/31/13_(EDITED/IMPUTED)
FAMILY'S_TOTAL_INCOME
ADULT_BODY_MASS_INDEX_(>17)_-_RD_5/3
AGE_OF_DIAGNOSIS-HIGH_BLOOD_PRESSURE
AGE_OF_DIAGNOSIS-CORONARY_HEART_DISEASE
AGE_OF_DIAGNOSIS-ANGINA
AGE_OF_DIAGNOSIS-HEART_ATTACK(MI)
AGE_OF_DIAGNOSIS-OTHER_HEART_DISEASE
AGE_OF_DIAGNOSIS-STROKE
AGE_OF_DIAGNOSIS-EMPHYSEMA
AGE_OF_DIAGNOSIS-HIGH_CHOLESTEROL
AGE_OF_DIAGNOSIS-DIABETES
AGE_OF_DIAGNOSIS-ARTHRITIS
AGE_OF_DIAGNOSIS-ASTHMA
AGE_OF_DIAGNOSIS-ADHD/ADD
#_WKS/MON_WOUT_HLTH_INS_PRV_YR-PN_18_ONL
#_OFFICE-BASED_PROVIDER_VISITS_13
#_OUTPATIENT_DEPT_PROVIDER_VISITS_13
#_EMERGENCY_ROOM_VISITS_13
#_NIGHTS_IN_HOSP_FOR_DISCHARGES_2013
FINAL_PERSON_WEIGHT_2013
TOTAL_OFFICE-BASED_EXP_13
TOTAL_OUTPATIENT_PROVIDER_EXP_13
TOT_HOSP_IP_FACILITY_+_DR_EXP_13
TOTAL_ER_FACILITY_+_DR_EXP_13
Categorical features
SEX
RACE/ETHNICITY_(EDITED/IMPUTED)
MARITAL_STATUS-12/31/13_(EDITED/IMPUTED)
HIGH_BLOOD_PRESSURE_DIAG_(>17)
CORONARY_HRT_DISEASE_DIAG_(>17)
MULT_DIAG_HIGH_BLOOD_PRESS_(>17)
ANGINA_DIAGNOSIS_(>17)
HEART_ATTACK_(MI)_DIAG_(>17)
OTHER_HEART_DISEASE_DIAG_(>17)
STROKE_DIAGNOSIS_(>17)
EMPHYSEMA_DIAGNOSIS_(>17)
CHRONC_BRONCHITS_LAST_12_MTHS_(>17)-R3/1
CHRONC_BRONCHITS_LAST_12_MTHS_(>17)-R5/3
HIGH_CHOLESTEROL_DIAGNOSIS_(>17)
CANCER_DIAGNOSIS_(>17)
CANCER_DIAGNOSED_-_BLADDER_(>17)
CANCER_DIAGNOSED_-_BREAST_(>17)
CANCER_DIAGNOSED_-_CERVIX_(>17)
CANCER_DIAGNOSED_-_COLON_(>17)
CANCER_DIAGNOSED_-_LUNG_(>17)
CANCER_DIAGNOSED_-_LYMPHOMA_(>17)
CANCER_DIAGNOSED_-_MELANOMA_(>17)
CANCER_DIAGNOSED_-_OTHER_(>17)
CANCER_DIAGNOSED_-_PROSTATE_(>17)
CANCER_DIAGNOSED_-_SKIN-NONMELANO_(>17)
CANCER_DIAGNOSED-SKIN-UNKNOWN_TYPE_(>17)
CANCER_DIAGNOSED_-_UTERUS_(>17)
DIABETES_DIAGNOSIS_(>17)
JOINT_PAIN_LAST_12_MONTHS_(>17)_-_RD_3/1
JOINT_PAIN_LAST_12_MONTHS_(>17)_-_RD_5/3
ARTHRITIS_DIAGNOSIS_(>17)
ASTHMA_DIAGNOSIS
DOES_PERSON_STILL_HAVE_ASTHMA-RD3/1
DOES_PERSON_STILL_HAVE_ASTHMA_-_RD_5/3
ASTHMA_ATTACK_LAST_12_MOS_-_RD3/1
USED_ACUTE_PRES_INHALER_LAST_3_MOS-RD5/3
USED>3ACUTE_CN_PRES_INH_LAST_3_MOS-RD5/3
EVER_USED_PREV_DAILY_ASTHMA_MEDS_-RD_5/3
NOW_TAKE_PREV_DAILY_ASTHMA_MEDS_-_RD_5/3
HAVE_PEAK_FLOW_METER_AT_HOME_-_RD_5/3
EVER_USED_PEAK_FLOW_METER_-_RD_5/3
ADHDADD_DIAGNOSIS_(5-17)
PREGNANT_DURING_REF_PERIOD_-_RD_3/1
LIMITATION_IN_PHYSICAL_FUNCTIONING-RD3/1
EMPLOYMENT_STATUS_RD_3/1
HAS_MORE_THAN_ONE_JOB_RD_3/1_INT_DATE
SELF-EMPLOYED_AT_RD_3/1_CMJ
CHOICE_OF_HEALTH_PLANS_AT_RD_3/1_CMJ
INDUSTRY_GROUP_RD_3/1_CMJ
UNION_STATUS_AT_RD_3/1_CMJ
OCCUPATION_GROUP_RD_3/1_CMJ
HEALTH_INSUR_HELD_FROM_RD_3/1_CMJ_(ED)
HEALTH_INSUR_OFFERED_BY_RD_3/1_CMJ_(ED)
EMPLOYER_OFFERS_HEALTH_INS_RD_3/1_CMJ
FULL_YEAR_INSURANCE_COVERAGE_STATUS_2013
ANYONE_IN_RU_HAVE_FSA_-_R3/1
PRIVATE_INSURANCE_ANY_TIME_IN_R5/R3
PUBLIC_INS_ANY_TIME_IN_R5/R3
INSURED_ANY_TIME_IN_R3/1
ANY_TIME_COVERAGE_BY_STATE_INS_-_R3/1
CENSUS_REGION_AS_OF_12/31/13
EDUCATION_RECODE_(EDITED)
Pretty good!
Next I'll prototype a function to produce bar plots for the categorical variables. Since the data points are weighted, I need to make sure I scale each point out correctly before I aggregate under each label as well.
Here's the function:
The categorical features above split up nicely into two groups that are interesting. The first consists of health status features, and the second consists of demographic and socioeconomic features. I'll group them accordingly:
Here are a few observations about these health status features:
- High blood pressure, diabetes, high cholesterol, joint pain and arthritis are relatively common in the population.
- Heart disease (angina, heart attack, etc) and stroke are less prevalent, but still substantial.
- Ditto for lung disease (emphysema, bronchitis, asthma).
- While a cancer diagnosis is relatively common, the individual cancer diagnoses have very low mass in the "yes" bucket. This will likely pose a problem for estimating models, unless we either upsample, or otherwise penalize the other responses. Including these features in our model may be informative if different cancers affect expenditure in different ways, but we'll have to balance this against the lack of reliability of parameter estimates that we'll get from these features. Unless we have a good reason to believe that different cancers cost very different amounts to treat, we may be best off using the aggregate cancer diagnosis feature, and dropping the individual diagnosis features.
- Practically no-one uses a peak flow meter.
- There's relatively incidence of ADD/ADHD in the population.
- Relatively few people are pregnant at the time of the survey.
- General limitation in physical functioning is fairly prevalent.
Next let's look at the demographic and socio-economic features.
About the demographic/socio-economic features:
- The male/female subpopulations are evenly balanced in the population, with slightly more women than men.
- The race/ethnicity distribution is consistent with other surveys.
- The distribution estimates that about 70m poeple are under 16, a similar number has never been married, about 130m people are married.
- About 150m people were employed when the survey was conducted. About 100m people are not employed (not in the labor force, or unemployed). This is consistent with reported data elsewhere. About 20m people are self employed, and about 10m people hold more than one job.
- Remarkably few people - less than 50m - have a choice of health plans at work. I'm interested in whether people who have a choice of health plans tend to pay less for healthcare on average, assuming there is any effect coming from encouraging people to think about their purchasing choices on subsequent utilization or on prices paid/negotiated.
- The modal entry for industry group and occupational group (by far) is inapplicable. This is a constructed feature in the dataset, so I assume this wasn't directly asked of respondents. It's hard to say what this might mean, or if it'll be useful to keep in our model.
- About 10m people are in a union. It would also be interesting to know whether being part of a union affects utilization or expenditure in any way.
- About 60m people did not hold health insurance at the time. These numbers are consistent with pre-ACA uninsured levels.
- About 80m people were offered health insurance through their employers, and 40m were not.
- About 180m people under 65 had private insurance coverage during the survey, about 50m people under 65 had only public insurance options, and about 50m people under 65 were uninsured. There were about 20m people over 65 holding a mix of private insurance and medicare, and a similar number with only medicare coverage.
- About 30m people lived in a household where someone had a flexible spending account. This is about $10\%$ of the population, but FSAs (especially coupled with high-deductible plans) are an interesting demand-side innovation and it would be interesting to understand if this plays a role in determining expenditure.
- A negligible number of people (relatively speaking) had coverage under a state insurance plan.
- The distribution of people over census regions is approximately as follows: about 120m in the south, about 75m people in the west, about 65m people in the midwest, and about 60m people in the northeast.
- The distibution over educational attainment is as follows: about 70m people have an associate's degree past high school, about 65m have a GED or high school diploma, a little under 60m have an 8th grade qualification or less, about 30m people stopped high school between 9th and 12th grade and do not have a high school diploma, a little over 40m people have a 4-year degree, about 25m people have a masters, doctoral or professional degree, and about 25m people fall into the inapplicable bucket.
Now that we've had a look at the categorical features, let's take a look at the continuous features.
Out[12]:
['AGE_AS_OF_12/31/13_(EDITED/IMPUTED)',
"FAMILY'S_TOTAL_INCOME",
'ADULT_BODY_MASS_INDEX_(>17)_-_RD_5/3',
'AGE_OF_DIAGNOSIS-HIGH_BLOOD_PRESSURE',
'AGE_OF_DIAGNOSIS-CORONARY_HEART_DISEASE',
'AGE_OF_DIAGNOSIS-ANGINA',
'AGE_OF_DIAGNOSIS-HEART_ATTACK(MI)',
'AGE_OF_DIAGNOSIS-OTHER_HEART_DISEASE',
'AGE_OF_DIAGNOSIS-STROKE',
'AGE_OF_DIAGNOSIS-EMPHYSEMA',
'AGE_OF_DIAGNOSIS-HIGH_CHOLESTEROL',
'AGE_OF_DIAGNOSIS-DIABETES',
'AGE_OF_DIAGNOSIS-ARTHRITIS',
'AGE_OF_DIAGNOSIS-ASTHMA',
'AGE_OF_DIAGNOSIS-ADHD/ADD',
'#_WKS/MON_WOUT_HLTH_INS_PRV_YR-PN_18_ONL',
'#_OFFICE-BASED_PROVIDER_VISITS_13',
'#_OUTPATIENT_DEPT_PROVIDER_VISITS_13',
'#_EMERGENCY_ROOM_VISITS_13',
'#_NIGHTS_IN_HOSP_FOR_DISCHARGES_2013',
'FINAL_PERSON_WEIGHT_2013',
'TOTAL_OFFICE-BASED_EXP_13',
'TOTAL_OUTPATIENT_PROVIDER_EXP_13',
'TOT_HOSP_IP_FACILITY_+_DR_EXP_13',
'TOTAL_ER_FACILITY_+_DR_EXP_13']
There are broadly speaking, continuous variables in the following categories: ages, income levels, time since some event, number of occurences of an event, and total expenditures.
Let's group these variables:
Let's look at the ages first.
The documentation only mentions that the value $-1$ is recorded for age when there is no valid data collected and no imputation can reasonably be made. In the following graph I'll censor this level, since there's not really a good way to interpret this that I can think of. I will probably also leave it out when building the model.
The first thing to notice is the huge mass at 85. There aren't actually over 6m people in the country of age exactly 85; the data is top-coded at this value to preserve respondent privacy. This means I won't be able to build a model that discriminates on age past 85, and I'll just have to lump everyone older than that into the same bucket. I'll hopefully be able to get a reasonable cross section of that subpopulation using other features.
Next let's look at the age of diagnosis features. Since most people don't have chronic illnesses of these types, the mass at -1 dominates the distribution. Accordingly, I'm going to censor those points in these plots, so each of these distributions are conditional on having that chronic condition.
Here's a modified function from the last post for producing histograms with weighted data.
- The "lifestyle" diseases: heart attack, angina, high blood pressure, coronary heart disease, stroke, emphysema, high cholesterol, and diabetes all follow very similar distributions, which are all not-quite normal, with modes in the 50-60 range, and with some right skew.
- It will be interesting to look at correlations between these lifestyle disease features. I don't know a priori if this should be high or low, but I would imagine that having multiple comorbidities would be predictive of higher expenditure on healthcare.
- Arthritis follows a similar distribution to the lifestyle diseases.
- Asthma is very heavily skewed left, and is most diagnosed at very young ages, with the diagnosis rate trailing off as the population gets older.
- The age of diagnosis for ADD/ADHD is very tightly clustered in the 0-20 range, with a bell shaped distribution, non-existent tails, and some skew to the left.
- A generally interesting hypothesis about the relationship between expenditure and the age of diagnosis of any of these diseases may be that the earlier the diagnosis, the better the patient is at making lifestyle changes that can mitigate expenditure levels. It might be interesting to see if this holds up in the modeling.
The BMI (body mass index) is computed by dividing a person's weight in pounds by their squared height in square inches and multiplying by 703.
Here's how to interpret the adult body mass index according to the CDC guidelines for interpretation.
The distribution of BMI in the population is estimated by this histogram to be bell-shaped, with a heavier right tail, and centered around 25 or so.
Next, on to the income variable:
count 36940.000000
mean 58199.456984
std 56276.325042
min -258220.000000
25% 20000.000000
50% 41207.000000
75% 79400.000000
max 543051.000000
Name: FAMILY'S_TOTAL_INCOME, dtype: float64
What's really weird about this is that I see responses below 0; far below 0!
Let's take a closer look.
Out[19]:
|
FAMILY'S_TOTAL_INCOME |
TOTAL_OFFICE-BASED_EXP_13 |
TOTAL_OUTPATIENT_PROVIDER_EXP_13 |
TOT_HOSP_IP_FACILITY_+_DR_EXP_13 |
TOTAL_ER_FACILITY_+_DR_EXP_13 |
| 54 |
-258220 |
1467 |
9 |
0 |
534 |
| 55 |
-258220 |
0 |
0 |
0 |
0 |
| 4843 |
-3005 |
0 |
0 |
0 |
0 |
| 4844 |
-3005 |
37 |
0 |
0 |
622 |
| 8200 |
-1500 |
2815 |
0 |
0 |
1146 |
| 27520 |
-2196 |
0 |
0 |
0 |
0 |
Thankfully there are not many data points here. An important question is how to interpret an income of $-258220$. The smaller values for negative incomes are also interesting - are these reported after something like adjusting for public assistance by the respondents?
Let's take a look at the survey documentation.
FAMINC13 contains total family income for each person’s CPS family. Family income was derived by constructing person-level total income comprising annual earnings from wages, salaries, bonuses, tips, commissions; business and farm gains and losses; unemployment and workers’ compensation; interest and dividends; alimony, child support, and other private cash transfers; private pensions, IRA withdrawals, social security, and veterans payments; supplemental security income and cash welfare payments from public assistance, Temporary Assistance for Needy Families, and related programs; gains or losses from estates, trusts, partnerships, S corporations, rent, and royalties; and a small amount of “other” income. Person-level income excluded tax refunds and capital gains. Person-level income totals were then summed over family members, as defined by CPSFAMID, to yield CPS family-level total income (FAMINC13).
Looking at the expenditure features, there's nothing particularly outlandish about these records apart from the incomes. Given this, I think I'm inclined to go by: "when in doubt, throw it out" when it comes to these data points.
Let's truncate at 0 and take a look at the shape of the distribution.
The distribution is unimodal (mode around $\$50,000$), and with most of the mass concentrated between $\$0$ and $\$200,000$, and a thin (but very well off!) tail past that point.
Let's move on to the durations variables.
Out[21]:
|
#_WKS/MON_WOUT_HLTH_INS_PRV_YR-PN_18_ONL |
| count |
36940.000000 |
| mean |
-0.899729 |
| std |
1.303947 |
| min |
-9.000000 |
| 25% |
-1.000000 |
| 50% |
-1.000000 |
| 75% |
-1.000000 |
| max |
52.000000 |
Out[22]:
-1 36508
12 81
6 49
4 42
3 37
2 36
10 24
5 24
8 24
1 21
7 20
9 14
11 14
-8 11
0 11
52 7
18 5
40 4
-9 2
26 2
30 1
43 1
36 1
-7 1
dtype: int64
Since the majority of respondents did carry insurance in the previous year, I'll drop the points at -1 to get a sense of the distribution conditional on not being insured.
This is quite interesting. Here's what the documentation has to say about this feature:
For persons who were covered by health insurance on January 1st, it was ascertained if they were ever without health insurance in the previous year (NOINSBEF). The number of weeks/months without health insurance was also ascertained (NOINSTM, NOINUNIT).
I'm interested to know if this feature might be predictive of expenditure. It's hard to say - going a long time without insurance might be reflective of being unemployed, or if the respondent did have the means to buy insurance, may be a proxy for their (lack of) demand for risk mitigation. If either of these correlations hold, this might say something about the effect of either bargaining position, or risk tolerance, on propensity to spend.
Next, the utilization features:
It seems clear that most people use relatively little healthcare. Let's confirm if a lot of these utilizations are at 0.
count 36940.000000
mean 4.454548
std 9.982592
min 0.000000
25% 0.000000
50% 1.000000
75% 5.000000
max 277.000000
Name: #_OFFICE-BASED_PROVIDER_VISITS_13, dtype: float64
count 36940.000000
mean 0.353952
std 2.373837
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 171.000000
Name: #_OUTPATIENT_DEPT_PROVIDER_VISITS_13, dtype: float64
count 36940.000000
mean 0.203303
std 0.633602
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 24.000000
Name: #_EMERGENCY_ROOM_VISITS_13, dtype: float64
count 36940.000000
mean 0.408663
std 3.817236
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 209.000000
Name: #_NIGHTS_IN_HOSP_FOR_DISCHARGES_2013, dtype: float64
That's definitely the case. Let's see what the distributions look like conditional on positive utilization.
These distributions all look very similar; the modal positive utilization level is one visit, and utilization rapidly decays, although there are very heavy tails for office based services and inpatient services (we've noticed this already in the expenditure features for these categories).
The main difference between these categories is the degree of use. For example, the number of people in the modal office visit level is over 50m, for outpatient and ER is about 30m, and for inpatient nights is about 5m.
Similar pictures in this case.
We've already looked at the patterns in expenditure, so this takes us to the end of the continuous variables we want to look at.
The next thing I'd like to do is to take a closer look at the features to check for bivariate relationships. Do specific diagnoses determine higher levels of expenditure? Do people who make more money pay more for healthcare? If there is such a relationship, is it linear? Is the growth faster or slower than linear? Do certain diagnoses, while not individually affecting expenditure much, combine to have a large effect, perhaps larger than the sum of their individual pieces?
Looking more closely at the data will, I hope, help me validate or refute interesting hypotheses like these, and possibly generate many more. The first step is to try and visualize the possible relationships, which will be in the next post.
Click to read and post comments
Dec 22, 2015
In the last post, I managed to construct the MEPS dataset as a csv file. Here's the result, for reference. (Warning: the file is large: about 250MB.)
The next thing to do is to take a look at the dataset, and see if we need to do any more processing. Since we have it written to disk as a csv file, I can import it using pandas' read_csv function. I may have to do a bit of plotting when I explore the dataset, so I'll go ahead and load all the packages I imagine I'll need for this.
Here's the boilerplate:
Let's load the dataset as a pandas dataframe and see what it contains:
/home/soumya/research/insight/insight_project/data
2013_consolidated_frame.csv dsm_data_cleaned_columns.pkl
2013_test_with_dummies.pkl dsm_data.csv
2013_train_with_dummies.pkl dsm_data.pkl
2013_validate_with_dummies.pkl dsm_data_scoped_variables.pkl
2013_with_dummies_.pkl Untitled.ipynb
2013_with_dummies.pkl
First let's take a look at what features we have:
Out[5]:
Index(['DWELLING UNIT ID', ' PERSON NUMBER', ' PERSON ID (DUID + PID)',
' PANEL NUMBER', ' FAMILY ID (STUDENT MERGED IN) - R3/1',
' FAMILY ID (STUDENT MERGED IN) - R4/2',
' FAMILY ID (STUDENT MERGED IN) - R5/3',
' FAMILY ID (STUDENT MERGED IN) - 12/31/13',
' ANNUAL FAMILY IDENTIFIER', ' CPSFAMID',
...
' TOTAL RX-PRV & TRI AMT 13', ' TOTAL RX-OTH COMBINED AMT 13',
' FINAL PERSON WEIGHT 2013', ' FINAL FAMILY WEIGHT 2013',
' POV ADJ FAMILY WGT-CPS FAM ON 12/31/13',
' FINAL SAQ PERSON WEIGHT 2013',
' FINAL DIABETES CARE SUPPLEMENT WEIGHT',
' VARIANCE ESTIMATION STRATUM - 2013',
' VARIANCE ESTIMATION PSU - 2013', ' '],
dtype='object', length=1792)
The first thing I get from this is that I need to learn what these features actually are, so I'll have to look up the MEPS survey methodology or documentation in order to figure out what these features actually are. There are 1792 features for each record, so this may be a bit involved.
The second thing I get out of looking at the column names is that they really do need a bit of cleanup. For example, the first feature is called "DWELLING UNIT ID", but the second is called " PERSON NUMBER" (note the extra space at the beginning). This seems to be a running problem in the column names. There's also a column right at the end with no name.
There are 1792 column names to fix, which is entirely too many to fix by hand.
On the other hand, it looks like the cleanup consists of a couple of simple rules:
Remove the initial space in the feature name
Pop the last entry from the column names
Let's write a function to do the first job:
Let's try this on our dataframe and see if it does the first task.
Out[7]:
|
DWELLING UNIT ID |
PERSON NUMBER |
PERSON ID (DUID + PID) |
PANEL NUMBER |
FAMILY ID (STUDENT MERGED IN) - R3/1 |
FAMILY ID (STUDENT MERGED IN) - R4/2 |
FAMILY ID (STUDENT MERGED IN) - R5/3 |
FAMILY ID (STUDENT MERGED IN) - 12/31/13 |
ANNUAL FAMILY IDENTIFIER |
CPSFAMID |
... |
TOTAL RX-PRV & TRI AMT 13 |
TOTAL RX-OTH COMBINED AMT 13 |
FINAL PERSON WEIGHT 2013 |
FINAL FAMILY WEIGHT 2013 |
POV ADJ FAMILY WGT-CPS FAM ON 12/31/13 |
FINAL SAQ PERSON WEIGHT 2013 |
FINAL DIABETES CARE SUPPLEMENT WEIGHT |
VARIANCE ESTIMATION STRATUM - 2013 |
VARIANCE ESTIMATION PSU - 2013 |
|
| 0 |
20004 |
101 |
20004101 |
17 |
A |
A |
A |
A |
A |
A |
... |
0 |
0 |
6594.801768 |
6263.398994 |
6263.398994 |
7960.062940 |
0 |
1109 |
1 |
NaN |
| 1 |
20004 |
102 |
20004102 |
17 |
A |
A |
A |
A |
A |
A |
... |
0 |
0 |
7144.183132 |
6263.398994 |
6263.398994 |
7971.337575 |
0 |
1109 |
1 |
NaN |
| 2 |
20004 |
103 |
20004103 |
17 |
A |
A |
A |
A |
A |
A |
... |
77 |
0 |
5616.278666 |
6263.398994 |
6263.398994 |
0.000000 |
0 |
1109 |
1 |
NaN |
| 3 |
20005 |
101 |
20005101 |
17 |
A |
A |
A |
A |
A |
A |
... |
0 |
53 |
4045.586998 |
6080.690177 |
6080.690177 |
5604.764494 |
0 |
1109 |
2 |
NaN |
| 4 |
20005 |
102 |
20005102 |
17 |
A |
A |
A |
A |
A |
A |
... |
0 |
0 |
6901.505814 |
6080.690177 |
6080.690177 |
7781.609891 |
0 |
1109 |
2 |
NaN |
5 rows × 1792 columns
So far so good; now I just need to get rid of the last column.
Out[8]:
|
DWELLING UNIT ID |
PERSON NUMBER |
PERSON ID (DUID + PID) |
PANEL NUMBER |
FAMILY ID (STUDENT MERGED IN) - R3/1 |
FAMILY ID (STUDENT MERGED IN) - R4/2 |
FAMILY ID (STUDENT MERGED IN) - R5/3 |
FAMILY ID (STUDENT MERGED IN) - 12/31/13 |
ANNUAL FAMILY IDENTIFIER |
CPSFAMID |
... |
TOT RX-OTH UNCLASS SRCE AMT 13 |
TOTAL RX-PRV & TRI AMT 13 |
TOTAL RX-OTH COMBINED AMT 13 |
FINAL PERSON WEIGHT 2013 |
FINAL FAMILY WEIGHT 2013 |
POV ADJ FAMILY WGT-CPS FAM ON 12/31/13 |
FINAL SAQ PERSON WEIGHT 2013 |
FINAL DIABETES CARE SUPPLEMENT WEIGHT |
VARIANCE ESTIMATION STRATUM - 2013 |
VARIANCE ESTIMATION PSU - 2013 |
| 0 |
20004 |
101 |
20004101 |
17 |
A |
A |
A |
A |
A |
A |
... |
0 |
0 |
0 |
6594.801768 |
6263.398994 |
6263.398994 |
7960.062940 |
0 |
1109 |
1 |
| 1 |
20004 |
102 |
20004102 |
17 |
A |
A |
A |
A |
A |
A |
... |
0 |
0 |
0 |
7144.183132 |
6263.398994 |
6263.398994 |
7971.337575 |
0 |
1109 |
1 |
| 2 |
20004 |
103 |
20004103 |
17 |
A |
A |
A |
A |
A |
A |
... |
0 |
77 |
0 |
5616.278666 |
6263.398994 |
6263.398994 |
0.000000 |
0 |
1109 |
1 |
| 3 |
20005 |
101 |
20005101 |
17 |
A |
A |
A |
A |
A |
A |
... |
0 |
0 |
53 |
4045.586998 |
6080.690177 |
6080.690177 |
5604.764494 |
0 |
1109 |
2 |
| 4 |
20005 |
102 |
20005102 |
17 |
A |
A |
A |
A |
A |
A |
... |
0 |
0 |
0 |
6901.505814 |
6080.690177 |
6080.690177 |
7781.609891 |
0 |
1109 |
2 |
5 rows × 1791 columns
Great, that worked. Alright, let's roll that into the clean_columns function:
Out[11]:
Index(['DWELLING UNIT ID', 'PERSON NUMBER', 'PERSON ID (DUID + PID)',
'PANEL NUMBER', 'FAMILY ID (STUDENT MERGED IN) - R3/1',
'FAMILY ID (STUDENT MERGED IN) - R4/2',
'FAMILY ID (STUDENT MERGED IN) - R5/3',
'FAMILY ID (STUDENT MERGED IN) - 12/31/13', 'ANNUAL FAMILY IDENTIFIER',
'CPSFAMID',
...
'TOT RX-OTH UNCLASS SRCE AMT 13', 'TOTAL RX-PRV & TRI AMT 13',
'TOTAL RX-OTH COMBINED AMT 13', 'FINAL PERSON WEIGHT 2013',
'FINAL FAMILY WEIGHT 2013', 'POV ADJ FAMILY WGT-CPS FAM ON 12/31/13',
'FINAL SAQ PERSON WEIGHT 2013', 'FINAL DIABETES CARE SUPPLEMENT WEIGHT',
'VARIANCE ESTIMATION STRATUM - 2013', 'VARIANCE ESTIMATION PSU - 2013'],
dtype='object', length=1791)
Perfect.
One thing worth remarking on is that I've actually extracted the variable description rather than the variable name in this step. This was a deliberate decision, as I find it much more intuitive to work with variable names like "MOST RECENT COLONOSCOPY (>39) - R5/3" rather than the variable name "CLNTST53". Those of you familiar with the constraints of variable names in SQL databases will object that this is going to cause trouble in the future. I decided to go with the more descriptive variable names, and write a coding-decoding function to help move data in and out of SQL databases a bit later in the project. Overall, I found this to be a reasonable choice.
To this end, let me go ahead and get rid of the spaces in the variables names before that causes any bugs down the road.
Let's run the data columns through this function:
Out[14]:
Index(['DWELLING_UNIT_ID', 'PERSON_NUMBER', 'PERSON_ID_(DUID_+_PID)',
'PANEL_NUMBER', 'FAMILY_ID_(STUDENT_MERGED_IN)_-_R3/1',
'FAMILY_ID_(STUDENT_MERGED_IN)_-_R4/2',
'FAMILY_ID_(STUDENT_MERGED_IN)_-_R5/3',
'FAMILY_ID_(STUDENT_MERGED_IN)_-_12/31/13', 'ANNUAL_FAMILY_IDENTIFIER',
'CPSFAMID',
...
'TOT_RX-OTH_UNCLASS_SRCE_AMT_13', 'TOTAL_RX-PRV_&_TRI_AMT_13',
'TOTAL_RX-OTH_COMBINED_AMT_13', 'FINAL_PERSON_WEIGHT_2013',
'FINAL_FAMILY_WEIGHT_2013', 'POV_ADJ_FAMILY_WGT-CPS_FAM_ON_12/31/13',
'FINAL_SAQ_PERSON_WEIGHT_2013', 'FINAL_DIABETES_CARE_SUPPLEMENT_WEIGHT',
'VARIANCE_ESTIMATION_STRATUM_-_2013', 'VARIANCE_ESTIMATION_PSU_-_2013'],
dtype='object', length=1791)
Next, I'd like to learn a bit more about what these features are. The main thing to keep in mind is that having a large number of features will make our model very unwieldy.
The crucial thing for this model is for it to be as useful as possible, and if I have to collect 1700 features or so about a person before I can give them any useful feedback, this model won't be used by anyone.
Secondly, for the purposes of my project, I'm most concerned with how the demographic, economic, or health status of a consumer drives cost, since these are the things that can inform how we choose to design the market for healthcare services.
To this end, let's take a look at what's available at the MEPS page that points to the dataset. This page gives us two helpful pieces of information: the first is the documentation for the dataset and the second is the codebook, which we've already used in constructing the csv file.
While the codebook contains the entire listing of variables in the dataset and the start-end points in the ASCII file, it does not tell us what values the variables actually take. To be fair, the variable names are fairly descriptive so we might be able to make educated guesses here, but a quick query of one of the columns in the dataset we've constructed:
Out[15]:
count 36940.000000
mean 0.928533
std 1.355903
min -9.000000
25% -1.000000
50% 2.000000
75% 2.000000
max 2.000000
Name: HIGH_CHOLESTEROL_DIAGNOSIS_(>17), dtype: float64
raises the question: what do the values of this variable mean? It has a max of 2, and a min of -9. That doesn't correspond to any prior I currently have over the values of a diagnosis of high cholesterol. I'd expect to see things like positive, negative, not tested, and so on. Let's take a more detailed look:
Out[16]:
2 18967
-1 10382
1 7497
-8 49
-7 24
-9 21
dtype: int64
The most frequently populated values are 2, 1 and -1, but there are also records with values -8, -7 and -9.
This warrants a closer look at the documentation. A quick search for "cholesterol" turns up:
High Cholesterol
Questions about high cholesterol were asked of persons aged 18 or older. Consequently, persons aged 17 or younger were coded as “Inapplicable” (-1) on these variables. CHOLDX ascertained whether the person had ever been diagnosed as having high cholesterol. Through 2007, a person-level variable (CHLAGE) indicated the age of diagnosis for high cholesterol on the Person-Level Use PUF. The age of diagnosis for high cholesterol (CHOLAGED) is included in this file. This variable is top-coded to 85 years of age.
which is somewhat helpful, but doesn't explain the use of the values -8 and -9.
Further digging into the documentation yields the following table:
2.2 Reserved Codes
The following reserved code values are used:
Value Definition
-1 INAPPLICABLE Question was not asked due to skip pattern
-2 DETERMINED IN PREVIOUS ROUND Question was not asked in round because there was no change in current main job since previous round
-7 REFUSED Question was asked and respondent refused to answer question
-8 DK Question was asked and respondent did not know answer
-9 NOT ASCERTAINED Interviewer did not record the data
-10 HOURLY WAGE >= $76.96 Hourly wage was top-coded for confidentiality
-13 INITIAL WAGE IMPUTED Hourly wage was previously imputed so an updated wage is not included in this file
This is helpful; it indicates that the most relevant levels for our analysis ought to be the following three:
- Positive diagnosis of high cholesterol
- Negative diagnosis of high cholesterol
- Inapplicable to respondent
and given the low population in levels below -1, we can either aggregate these data points with level==-1, or discard them entirely. In the following analysis, we choose to discard them.
A final question that needs to be asked at this step is which level corresponds to a positive diagnosis of high cholesterol? A close look at the documentation indicates that level==1 corresponds to having the condition, and level==2 corresponds to not having the condition. Knowing which level is which doesn't particularly affect the computations that generate our models going forward, but it is important for being able to interpret the models.
As a quick sanity check, let's get a visualization of the distribution of the data between these levels:
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x7efd06e86860>
Next, instead of using level==1 or level==2 to describe the values that a categorical feature takes, I'd like to translate the levels of these categorical variables into their descriptions so my plots are easier to read.
Here's a dictionary that implements this using the documentation quoted above:
Let's write a function that implements this as necessary for a feature.
The features whose levels I need to interpret in this way are all categorical features; the documentation indicates that the continuous variables aren't generally coded (apart from top-coding to preserve privacy). Accordingly, it would help to be able to automatically determine whether the feature I'm looking at is categorical or continuous. I'll try implementing the following rough rule of thumb: if a feature has less than 10-20 levels, I'll assume it's categorical, and otherwise I'll assume it's continuous.
Let's classify the features in the dataset:
Finally, I'll write a function that interprets the levels of the features in any categorical variable:
Here's an example of how it transforms the levels in a plot:
That visualization was a nice way of learning something about the data, and confirming a hunch we had about how to interpret it. In fact, simple visual plots of different features in the dataset are a crucial way of getting some initial insight into the data, and helping us prioritize the most fruitful approaches for modeling the data.
To fix ideas, let's scope down what we want to go into our model.
Explicitly, we'd like to be able to answer the following questions: which (if any) of the following features about a person determine what they will pay for healthcare in a year?
- Demographic information:
- Age
- Sex
- Race/Ethnicity
- Marital Status
- Educational attainment
- Census region
- Income information (these are all top-coded in the dataset for confidentiality reasons):
- Idiosyncratic health status variables:
- High blood pressure, including multiple diagnoses
- Heart disease (including coronary heart disease, angina, myocardial infarction, and other unspecified heart disease)
- Stroke
- Emphysema
- Chronic Bronchitis
- High cholesterol, including the age of diagnosis
- Cancer
- Diabetes
- Joint pain
- Arthritis
- Asthma
- Attention Deficit Hyperactivity Disorder/Attention Deficit Disorder (ADHD/ADD)
- Adult Body Mass Index
- Health Insurance Variables
- Public Insurance Indicators
- Private Insurance Indicators
- Any Insurance Indicators
- HMO plan
- Utilization, expenditure and source of payments information
It's fair to say that this is a lot of information to process. One thing that stands out here is that there are a great number of features already recorded for us in this dataset; this is in sharp contrast to some projects I've done using text and/or social media data, in which the hard part of the modeling process is feature construction in some appropriate way.
Since there is such a rich variety of features, I imagine most of the feature engineering in this project will amount to bucketing variables and levels in appropriate ways to optimize explained variation, rather than constructing new features from scratch.
To this end, let's make a list of the exact features we want to start working with.
- AGE AS OF 12/31/13 (EDITED/IMPUTED)
- SEX
- RACE/ETHNICITY (EDITED/IMPUTED)
- MARITAL STATUS-12/31/13 (EDITED/IMPUTED)
- FAMILY'S TOTAL INCOME
- ADULT BODY MASS INDEX (>17) - RD 5/3
- HIGH BLOOD PRESSURE DIAG (>17)
- CORONARY HRT DISEASE DIAG (>17)
- AGE OF DIAGNOSIS-HIGH BLOOD PRESSURE
- MULT DIAG HIGH BLOOD PRESS (>17)
- AGE OF DIAGNOSIS-CORONARY HEART DISEASE
- ANGINA DIAGNOSIS (>17)
- AGE OF DIAGNOSIS-ANGINA
- HEART ATTACK (MI) DIAG (>17)
- AGE OF DIAGNOSIS-HEART ATTACK(MI)
- OTHER HEART DISEASE DIAG (>17)
- AGE OF DIAGNOSIS-OTHER HEART DISEASE
- STROKE DIAGNOSIS (>17)
- AGE OF DIAGNOSIS-STROKE
- EMPHYSEMA DIAGNOSIS (>17)
- AGE OF DIAGNOSIS-EMPHYSEMA
- CHRONC BRONCHITS LAST 12 MTHS (>17)-R3/1
- CHRONC BRONCHITS LAST 12 MTHS (>17)-R5/3
- HIGH CHOLESTEROL DIAGNOSIS (>17)
- AGE OF DIAGNOSIS-HIGH CHOLESTEROL
- CANCER DIAGNOSIS (>17)
- CANCER DIAGNOSED - BLADDER (>17)
- CANCER DIAGNOSED - BREAST (>17)
- CANCER DIAGNOSED - CERVIX (>17)
- CANCER DIAGNOSED - COLON (>17)
- CANCER DIAGNOSED - LUNG (>17)
- CANCER DIAGNOSED - LYMPHOMA (>17)
- CANCER DIAGNOSED - MELANOMA (>17)
- CANCER DIAGNOSED - OTHER (>17)
- CANCER DIAGNOSED - PROSTATE (>17)
- CANCER DIAGNOSED - SKIN-NONMELANO (>17)
- CANCER DIAGNOSED-SKIN-UNKNOWN TYPE (>17)
- CANCER DIAGNOSED - UTERUS (>17)
- DIABETES DIAGNOSIS (>17)
- AGE OF DIAGNOSIS-DIABETES
- JOINT PAIN LAST 12 MONTHS (>17) - RD 3/1
- JOINT PAIN LAST 12 MONTHS (>17) - RD 5/3
- ARTHRITIS DIAGNOSIS (>17)
- AGE OF DIAGNOSIS-ARTHRITIS
- ASTHMA DIAGNOSIS
- AGE OF DIAGNOSIS-ASTHMA
- DOES PERSON STILL HAVE ASTHMA-RD3/1
- DOES PERSON STILL HAVE ASTHMA - RD 5/3
- ASTHMA ATTACK LAST 12 MOS - RD3/1
- USED ACUTE PRES INHALER LAST 3 MOS-RD5/3
- USED>3ACUTE CN PRES INH LAST 3 MOS-RD5/3
- EVER USED PREV DAILY ASTHMA MEDS -RD 5/3
- NOW TAKE PREV DAILY ASTHMA MEDS - RD 5/3
- HAVE PEAK FLOW METER AT HOME - RD 5/3
- EVER USED PEAK FLOW METER - RD 5/3
- ADHDADD DIAGNOSIS (5-17)
- AGE OF DIAGNOSIS-ADHD/ADD
- PREGNANT DURING REF PERIOD - RD 3/1
- LIMITATION IN PHYSICAL FUNCTIONING-RD3/1
- EMPLOYMENT STATUS RD 3/1
- HAS MORE THAN ONE JOB RD 3/1 INT DATE
- SELF-EMPLOYED AT RD 3/1 CMJ
- CHOICE OF HEALTH PLANS AT RD 3/1 CMJ
- INDUSTRY GROUP RD 3/1 CMJ
- UNION STATUS AT RD 3/1 CMJ
- OCCUPATION GROUP RD 3/1 CMJ
- HEALTH INSUR HELD FROM RD 3/1 CMJ (ED)
- HEALTH INSUR OFFERED BY RD 3/1 CMJ (ED)
- EMPLOYER OFFERS HEALTH INS RD 3/1 CMJ
- FULL YEAR INSURANCE COVERAGE STATUS 2013
- ANYONE IN RU HAVE FSA - R3/1
- # WKS/MON WOUT HLTH INS PRV YR-PN 18 ONL
- PRIVATE INSURANCE ANY TIME IN R5/R3
- PUBLIC INS ANY TIME IN R5/R3
- INSURED ANY TIME IN R3/1
- ANY TIME COVERAGE BY STATE INS - R3/1
- # OFFICE-BASED PROVIDER VISITS 13
- # OUTPATIENT DEPT PROVIDER VISITS 13
- # EMERGENCY ROOM VISITS 13
- # NIGHTS IN HOSP FOR DISCHARGES 2013
- FINAL PERSON WEIGHT 2013
- TOTAL OFFICE-BASED EXP 13
- TOTAL OUTPATIENT PROVIDER EXP 13
- TOT HOSP IP FACILITY + DR EXP 13
- TOTAL ER FACILITY + DR EXP 13
- CENSUS REGION AS OF 12/31/13
- EDUCATION RECODE (EDITED)
It's going to be extremely interesting to see if these have explanatory power for the statistical variation in spending on healthcare!
Speaking of which, what are the variables we're trying to predict?
- TOTAL OFFICE-BASED EXP 13
- TOTAL OUTPATIENT PROVIDER EXP 13
- TOT HOSP IP FACILITY + DR EXP 13
- TOTAL ER FACILITY + DR EXP 13
The dataset breaks out healthcare expenses into these four location based categories: office, outpatient, inpatient and ER.
Another modeling choice that needs to be made is whether to try and predict total expenses, or predict each category of expense individually. In my Insight project, I opted to try and predict them individually, to see if I could shed some light on whether the service location affected prices charged. There's reason to think this may be the case: ER sticker shock is widely reported in the news, more so than office visits.
Finally, it's time to reflect a bit on what we've done here. We've isolated explanatory variables that we hope and expect should be predictive of how much a person is charged for healthcare services in four different categories. That's great, but there's a very large number of predictors: over 80. Since the ultimate goal is to build a usable interface to the model that people can use to get a sense of their expected outlays on healthcare, it makes sense to refine and reduce this list. But how?
One way to think about this is to consider what information people are likely to be comfortable sharing about themselves that could help to predict expenses. Another is to focus on what the most predictive variables are, and then ask users to populate these variables in order to get a prediction.
In my Insight project, I opted to balance both these concerns by ranking the explanatory variables in terms of statistical variation explained, cutting off at a reasonable threshold (somewhere in between 10 and 20 variables, corresponding to 10-20 questions asked of the user), and also culling variables that would be particularly burdensome to the user to have to report.
The next thing to think about is: now that we have explanatory variables and explained variables selected, how do we best go about building a model?
My next step is to examine the data feature by feature, to get a better sense of what it looks like and to pick up hunches for what may explain the variation. Along the way I'm sure I'll develop some more detailed hypotheses about healthcare spending, and discover that some of what I implicitly assume to be true about the healthcare market isn't actually so.
First, let's look at the explained variables:
Out[25]:
count 36940.000000
mean 884.442799
std 3290.683088
min 0.000000
25% 0.000000
50% 156.000000
75% 653.000000
max 195295.000000
Name: TOTAL_OFFICE-BASED_EXP_13, dtype: float64
This is interesting. It looks like upto half of respondents spend less than $200 for office-based services in a year. The maximum though, is concerning: over a hundred thousand dollars? Let's take a closer look at that.
Out[26]:
|
DWELLING_UNIT_ID |
PERSON_NUMBER |
PERSON_ID_(DUID_+_PID) |
PANEL_NUMBER |
FAMILY_ID_(STUDENT_MERGED_IN)_-_R3/1 |
FAMILY_ID_(STUDENT_MERGED_IN)_-_R4/2 |
FAMILY_ID_(STUDENT_MERGED_IN)_-_R5/3 |
FAMILY_ID_(STUDENT_MERGED_IN)_-_12/31/13 |
ANNUAL_FAMILY_IDENTIFIER |
CPSFAMID |
... |
TOT_RX-OTH_UNCLASS_SRCE_AMT_13 |
TOTAL_RX-PRV_&_TRI_AMT_13 |
TOTAL_RX-OTH_COMBINED_AMT_13 |
FINAL_PERSON_WEIGHT_2013 |
FINAL_FAMILY_WEIGHT_2013 |
POV_ADJ_FAMILY_WGT-CPS_FAM_ON_12/31/13 |
FINAL_SAQ_PERSON_WEIGHT_2013 |
FINAL_DIABETES_CARE_SUPPLEMENT_WEIGHT |
VARIANCE_ESTIMATION_STRATUM_-_2013 |
VARIANCE_ESTIMATION_PSU_-_2013 |
| 2384 |
21174 |
101 |
21174101 |
17 |
A |
A |
A |
A |
A |
A |
... |
0 |
0 |
0 |
10189.789734 |
8764.276613 |
8764.276613 |
11832.982213 |
13517.344724 |
1007 |
1 |
| 2834 |
21386 |
102 |
21386102 |
17 |
A |
A |
A |
A |
A |
A |
... |
0 |
16487 |
0 |
22234.388648 |
26645.022692 |
26645.022692 |
24377.214141 |
26657.263092 |
1065 |
1 |
| 15019 |
27647 |
105 |
27647105 |
17 |
A |
A |
A |
A |
A |
A |
... |
0 |
173 |
0 |
16134.188642 |
6000.045922 |
6000.045922 |
0.000000 |
0.000000 |
1149 |
2 |
| 15659 |
27976 |
101 |
27976101 |
17 |
A |
A |
A |
A |
A |
A |
... |
0 |
0 |
0 |
15650.765936 |
12480.521290 |
12480.521290 |
14907.860615 |
16293.073976 |
1140 |
1 |
4 rows × 1791 columns
There are 4 people with office based expenses of over 100k. I'd like to take a look to see if these records are bad data, or if they should be kept in the dataset. My prior is that they should be kept in the dataset (the MEPS survey has been running for over 15 years by people who I can only assume know what they're doing). But it doesn't hurt to take a closer look.
Now, I don't want to look at all the extra features in the dataset beyond those that I've decided I care about, so let's go about trimming the dataframe down to contain just these features.
As a matter of good practice, let's write the current dataset to a file, in case we break something and need to come back to this checkpoint.
I've manually written the explanatory variables to a text file located at "/home/soumya/research/insight/insight_project/modeling/code/dsm_exog_.txt". I'm going to read this into memory:
['AGE_AS_OF_12/31/13_(EDITED/IMPUTED)', 'SEX', 'RACE/ETHNICITY_(EDITED/IMPUTED)', 'MARITAL_STATUS-12/31/13_(EDITED/IMPUTED)', "FAMILY'S_TOTAL_INCOME", 'ADULT_BODY_MASS_INDEX_(>17)_-_RD_5/3', 'HIGH_BLOOD_PRESSURE_DIAG_(>17)', 'CORONARY_HRT_DISEASE_DIAG_(>17)', 'AGE_OF_DIAGNOSIS-HIGH_BLOOD_PRESSURE', 'MULT_DIAG_HIGH_BLOOD_PRESS_(>17)', 'AGE_OF_DIAGNOSIS-CORONARY_HEART_DISEASE', 'ANGINA_DIAGNOSIS_(>17)', 'AGE_OF_DIAGNOSIS-ANGINA', 'HEART_ATTACK_(MI)_DIAG_(>17)', 'AGE_OF_DIAGNOSIS-HEART_ATTACK(MI)', 'OTHER_HEART_DISEASE_DIAG_(>17)', 'AGE_OF_DIAGNOSIS-OTHER_HEART_DISEASE', 'STROKE_DIAGNOSIS_(>17)', 'AGE_OF_DIAGNOSIS-STROKE', 'EMPHYSEMA_DIAGNOSIS_(>17)', 'AGE_OF_DIAGNOSIS-EMPHYSEMA', 'CHRONC_BRONCHITS_LAST_12_MTHS_(>17)-R3/1', 'CHRONC_BRONCHITS_LAST_12_MTHS_(>17)-R5/3', 'HIGH_CHOLESTEROL_DIAGNOSIS_(>17)', 'AGE_OF_DIAGNOSIS-HIGH_CHOLESTEROL', 'CANCER_DIAGNOSIS_(>17)', 'CANCER_DIAGNOSED_-_BLADDER_(>17)', 'CANCER_DIAGNOSED_-_BREAST_(>17)', 'CANCER_DIAGNOSED_-_CERVIX_(>17)', 'CANCER_DIAGNOSED_-_COLON_(>17)', 'CANCER_DIAGNOSED_-_LUNG_(>17)', 'CANCER_DIAGNOSED_-_LYMPHOMA_(>17)', 'CANCER_DIAGNOSED_-_MELANOMA_(>17)', 'CANCER_DIAGNOSED_-_OTHER_(>17)', 'CANCER_DIAGNOSED_-_PROSTATE_(>17)', 'CANCER_DIAGNOSED_-_SKIN-NONMELANO_(>17)', 'CANCER_DIAGNOSED-SKIN-UNKNOWN_TYPE_(>17)', 'CANCER_DIAGNOSED_-_UTERUS_(>17)', 'DIABETES_DIAGNOSIS_(>17)', 'AGE_OF_DIAGNOSIS-DIABETES', 'JOINT_PAIN_LAST_12_MONTHS_(>17)_-_RD_3/1', 'JOINT_PAIN_LAST_12_MONTHS_(>17)_-_RD_5/3', 'ARTHRITIS_DIAGNOSIS_(>17)', 'AGE_OF_DIAGNOSIS-ARTHRITIS', 'ASTHMA_DIAGNOSIS', 'AGE_OF_DIAGNOSIS-ASTHMA', 'DOES_PERSON_STILL_HAVE_ASTHMA-RD3/1', 'DOES_PERSON_STILL_HAVE_ASTHMA_-_RD_5/3', 'ASTHMA_ATTACK_LAST_12_MOS_-_RD3/1', 'USED_ACUTE_PRES_INHALER_LAST_3_MOS-RD5/3', 'USED>3ACUTE_CN_PRES_INH_LAST_3_MOS-RD5/3', 'EVER_USED_PREV_DAILY_ASTHMA_MEDS_-RD_5/3', 'NOW_TAKE_PREV_DAILY_ASTHMA_MEDS_-_RD_5/3', 'HAVE_PEAK_FLOW_METER_AT_HOME_-_RD_5/3', 'EVER_USED_PEAK_FLOW_METER_-_RD_5/3', 'ADHDADD_DIAGNOSIS_(5-17)', 'AGE_OF_DIAGNOSIS-ADHD/ADD', 'PREGNANT_DURING_REF_PERIOD_-_RD_3/1', 'LIMITATION_IN_PHYSICAL_FUNCTIONING-RD3/1', 'EMPLOYMENT_STATUS_RD_3/1', 'HAS_MORE_THAN_ONE_JOB_RD_3/1_INT_DATE', 'SELF-EMPLOYED_AT_RD_3/1_CMJ', 'CHOICE_OF_HEALTH_PLANS_AT_RD_3/1_CMJ', 'INDUSTRY_GROUP_RD_3/1_CMJ', 'UNION_STATUS_AT_RD_3/1_CMJ', 'OCCUPATION_GROUP_RD_3/1_CMJ', 'HEALTH_INSUR_HELD_FROM_RD_3/1_CMJ_(ED)', 'HEALTH_INSUR_OFFERED_BY_RD_3/1_CMJ_(ED)', 'EMPLOYER_OFFERS_HEALTH_INS_RD_3/1_CMJ', 'FULL_YEAR_INSURANCE_COVERAGE_STATUS_2013', 'ANYONE_IN_RU_HAVE_FSA_-_R3/1', '#_WKS/MON_WOUT_HLTH_INS_PRV_YR-PN_18_ONL', 'PRIVATE_INSURANCE_ANY_TIME_IN_R5/R3', 'PUBLIC_INS_ANY_TIME_IN_R5/R3', 'INSURED_ANY_TIME_IN_R3/1', 'ANY_TIME_COVERAGE_BY_STATE_INS_-_R3/1', '#_OFFICE-BASED_PROVIDER_VISITS_13', '#_OUTPATIENT_DEPT_PROVIDER_VISITS_13', '#_EMERGENCY_ROOM_VISITS_13', '#_NIGHTS_IN_HOSP_FOR_DISCHARGES_2013', 'FINAL_PERSON_WEIGHT_2013', 'TOTAL_OFFICE-BASED_EXP_13', 'TOTAL_OUTPATIENT_PROVIDER_EXP_13', 'TOT_HOSP_IP_FACILITY_+_DR_EXP_13', 'TOTAL_ER_FACILITY_+_DR_EXP_13', 'CENSUS_REGION_AS_OF_12/31/13', 'EDUCATION_RECODE_(EDITED)']
I'm going to make sure there are no duplicates in my list:
There were 87 variables to start.
There were 87 variables at the end.
For future reference, let's write this dataset to disk so we don't have to do all this over again.
Out[32]:
|
AGE_AS_OF_12/31/13_(EDITED/IMPUTED) |
SEX |
RACE/ETHNICITY_(EDITED/IMPUTED) |
MARITAL_STATUS-12/31/13_(EDITED/IMPUTED) |
FAMILY'S_TOTAL_INCOME |
ADULT_BODY_MASS_INDEX_(>17)_-_RD_5/3 |
HIGH_BLOOD_PRESSURE_DIAG_(>17) |
CORONARY_HRT_DISEASE_DIAG_(>17) |
AGE_OF_DIAGNOSIS-HIGH_BLOOD_PRESSURE |
MULT_DIAG_HIGH_BLOOD_PRESS_(>17) |
... |
#_OUTPATIENT_DEPT_PROVIDER_VISITS_13 |
#_EMERGENCY_ROOM_VISITS_13 |
#_NIGHTS_IN_HOSP_FOR_DISCHARGES_2013 |
FINAL_PERSON_WEIGHT_2013 |
TOTAL_OFFICE-BASED_EXP_13 |
TOTAL_OUTPATIENT_PROVIDER_EXP_13 |
TOT_HOSP_IP_FACILITY_+_DR_EXP_13 |
TOTAL_ER_FACILITY_+_DR_EXP_13 |
CENSUS_REGION_AS_OF_12/31/13 |
EDUCATION_RECODE_(EDITED) |
| 0 |
39 |
1 |
1 |
1 |
85728 |
35.5 |
2 |
2 |
-1 |
-1 |
... |
0 |
0 |
0 |
6594.801768 |
428 |
0 |
0 |
0 |
3 |
14 |
| 1 |
40 |
2 |
1 |
1 |
85728 |
28.5 |
2 |
2 |
-1 |
-1 |
... |
0 |
0 |
0 |
7144.183132 |
0 |
0 |
0 |
0 |
3 |
13 |
| 2 |
10 |
2 |
1 |
6 |
85728 |
-1.0 |
-1 |
-1 |
-1 |
-1 |
... |
2 |
0 |
0 |
5616.278666 |
0 |
108 |
0 |
0 |
3 |
1 |
| 3 |
52 |
1 |
1 |
4 |
20142 |
26.6 |
1 |
2 |
41 |
2 |
... |
0 |
0 |
0 |
4045.586998 |
674 |
0 |
0 |
0 |
1 |
14 |
| 4 |
22 |
1 |
1 |
5 |
20142 |
22.9 |
2 |
2 |
-1 |
-1 |
... |
0 |
0 |
0 |
6901.505814 |
0 |
0 |
0 |
0 |
1 |
14 |
5 rows × 87 columns
Great! Now let's look at those four data points more closely. N.B: Note the change of assignment to the name "data".
Out[33]:
|
#_OFFICE-BASED_PROVIDER_VISITS_13 |
TOTAL_OFFICE-BASED_EXP_13 |
| 2384 |
168 |
109403 |
| 2834 |
164 |
195295 |
| 15019 |
51 |
137891 |
| 15659 |
128 |
183384 |
Ok, if you look at the feature called "# OFFICE-BASED PROVIDER VISITS 13", it looks like these people had 168, 164, 51 and 128 visits each. While this certainly seems anomalous, it's consistent with the high values for amounts billed for office-based expenses.
At this point it's unclear whether these points should be left in the dataset, or thrown out. On the one hand, one would hope that most people don't need over 160 visits to an office-based provider in a year. On the other hand, it's a well-documented fact that a small handful of very ill people account for the vast majority of healthcare costs in the U.S. every year (this is well known to third-party payors). This may make it important to keep in these outliers in the final model.
I'd like to take a closer look at these points. What was the average charge per visit for these respondents?
Out[34]:
2384 651.208333
2834 1190.823171
15019 2703.745098
15659 1432.687500
dtype: float64
Between $\$650$ and $\$2700$. Presumably these were specialists, and expensive.
If these amounts are not fraudulent, I can imagine that it would be important to leave them in while modeling, to help account for the risk of large healthcare bills. Since there are only four respondents with such high levels of expenditure, let's take a closer look to see if we can find out what drove their spending.
Out[35]:
|
AGE_AS_OF_12/31/13_(EDITED/IMPUTED) |
SEX |
RACE/ETHNICITY_(EDITED/IMPUTED) |
MARITAL_STATUS-12/31/13_(EDITED/IMPUTED) |
FAMILY'S_TOTAL_INCOME |
ADULT_BODY_MASS_INDEX_(>17)_-_RD_5/3 |
HIGH_BLOOD_PRESSURE_DIAG_(>17) |
CORONARY_HRT_DISEASE_DIAG_(>17) |
AGE_OF_DIAGNOSIS-HIGH_BLOOD_PRESSURE |
MULT_DIAG_HIGH_BLOOD_PRESS_(>17) |
... |
#_OUTPATIENT_DEPT_PROVIDER_VISITS_13 |
#_EMERGENCY_ROOM_VISITS_13 |
#_NIGHTS_IN_HOSP_FOR_DISCHARGES_2013 |
FINAL_PERSON_WEIGHT_2013 |
TOTAL_OFFICE-BASED_EXP_13 |
TOTAL_OUTPATIENT_PROVIDER_EXP_13 |
TOT_HOSP_IP_FACILITY_+_DR_EXP_13 |
TOTAL_ER_FACILITY_+_DR_EXP_13 |
CENSUS_REGION_AS_OF_12/31/13 |
EDUCATION_RECODE_(EDITED) |
| 2384 |
54 |
1 |
2 |
3 |
8652 |
35.0 |
1 |
2 |
50 |
2 |
... |
2 |
3 |
0 |
10189.789734 |
109403 |
1122 |
0 |
1878 |
3 |
2 |
| 2834 |
67 |
2 |
2 |
1 |
126230 |
26.3 |
1 |
1 |
35 |
1 |
... |
0 |
0 |
17 |
22234.388648 |
195295 |
0 |
74030 |
0 |
4 |
14 |
| 15019 |
10 |
1 |
3 |
6 |
73568 |
-1.0 |
-1 |
-1 |
-1 |
-1 |
... |
11 |
3 |
26 |
16134.188642 |
137891 |
3444 |
59615 |
1163 |
3 |
1 |
| 15659 |
68 |
2 |
5 |
3 |
31349 |
27.8 |
1 |
2 |
52 |
1 |
... |
9 |
3 |
7 |
15650.765936 |
183384 |
484 |
10670 |
1600 |
1 |
14 |
4 rows × 87 columns
I'll take a look at these respondents one by one:
AGE_AS_OF_12/31/13_(EDITED/IMPUTED) 54
SEX MALE
RACE/ETHNICITY_(EDITED/IMPUTED) NON-HISPANIC WHITE ONLY
MARITAL_STATUS-12/31/13_(EDITED/IMPUTED) DIVORCED
FAMILY'S_TOTAL_INCOME 8652
ADULT_BODY_MASS_INDEX_(>17)_-_RD_5/3 35.0
HIGH_BLOOD_PRESSURE_DIAG_(>17) Yes
CORONARY_HRT_DISEASE_DIAG_(>17) No
AGE_OF_DIAGNOSIS-HIGH_BLOOD_PRESSURE 50
MULT_DIAG_HIGH_BLOOD_PRESS_(>17) No
AGE_OF_DIAGNOSIS-CORONARY_HEART_DISEASE inapplicable
ANGINA_DIAGNOSIS_(>17) No
AGE_OF_DIAGNOSIS-ANGINA inapplicable
HEART_ATTACK_(MI)_DIAG_(>17) No
AGE_OF_DIAGNOSIS-HEART_ATTACK(MI) inapplicable
OTHER_HEART_DISEASE_DIAG_(>17) No
AGE_OF_DIAGNOSIS-OTHER_HEART_DISEASE inapplicable
STROKE_DIAGNOSIS_(>17) No
AGE_OF_DIAGNOSIS-STROKE inapplicable
EMPHYSEMA_DIAGNOSIS_(>17) Yes
AGE_OF_DIAGNOSIS-EMPHYSEMA 45
CHRONC_BRONCHITS_LAST_12_MTHS_(>17)-R3/1 No
CHRONC_BRONCHITS_LAST_12_MTHS_(>17)-R5/3 No
HIGH_CHOLESTEROL_DIAGNOSIS_(>17) No
AGE_OF_DIAGNOSIS-HIGH_CHOLESTEROL inapplicable
CANCER_DIAGNOSIS_(>17) No
CANCER_DIAGNOSED_-_BLADDER_(>17) Inapplicable
CANCER_DIAGNOSED_-_BREAST_(>17) Inapplicable
CANCER_DIAGNOSED_-_CERVIX_(>17) Inapplicable
CANCER_DIAGNOSED_-_COLON_(>17) Inapplicable
CANCER_DIAGNOSED_-_LUNG_(>17) Inapplicable
CANCER_DIAGNOSED_-_LYMPHOMA_(>17) Inapplicable
CANCER_DIAGNOSED_-_MELANOMA_(>17) Inapplicable
CANCER_DIAGNOSED_-_OTHER_(>17) Inapplicable
CANCER_DIAGNOSED_-_PROSTATE_(>17) Inapplicable
CANCER_DIAGNOSED_-_SKIN-NONMELANO_(>17) Inapplicable
CANCER_DIAGNOSED-SKIN-UNKNOWN_TYPE_(>17) Inapplicable
CANCER_DIAGNOSED_-_UTERUS_(>17) Inapplicable
DIABETES_DIAGNOSIS_(>17) Yes
AGE_OF_DIAGNOSIS-DIABETES 53
JOINT_PAIN_LAST_12_MONTHS_(>17)_-_RD_3/1 Yes
JOINT_PAIN_LAST_12_MONTHS_(>17)_-_RD_5/3 Yes
ARTHRITIS_DIAGNOSIS_(>17) No
AGE_OF_DIAGNOSIS-ARTHRITIS inapplicable
ASTHMA_DIAGNOSIS No
AGE_OF_DIAGNOSIS-ASTHMA inapplicable
DOES_PERSON_STILL_HAVE_ASTHMA-RD3/1 Inapplicable
DOES_PERSON_STILL_HAVE_ASTHMA_-_RD_5/3 Inapplicable
ASTHMA_ATTACK_LAST_12_MOS_-_RD3/1 Inapplicable
USED_ACUTE_PRES_INHALER_LAST_3_MOS-RD5/3 Inapplicable
USED>3ACUTE_CN_PRES_INH_LAST_3_MOS-RD5/3 Inapplicable
EVER_USED_PREV_DAILY_ASTHMA_MEDS_-RD_5/3 Inapplicable
NOW_TAKE_PREV_DAILY_ASTHMA_MEDS_-_RD_5/3 Inapplicable
HAVE_PEAK_FLOW_METER_AT_HOME_-_RD_5/3 Inapplicable
EVER_USED_PEAK_FLOW_METER_-_RD_5/3 Inapplicable
ADHDADD_DIAGNOSIS_(5-17) Inapplicable
AGE_OF_DIAGNOSIS-ADHD/ADD inapplicable
PREGNANT_DURING_REF_PERIOD_-_RD_3/1 Inapplicable
LIMITATION_IN_PHYSICAL_FUNCTIONING-RD3/1 Yes
EMPLOYMENT_STATUS_RD_3/1 NOT EMPLOYED DURING RD 3/1
HAS_MORE_THAN_ONE_JOB_RD_3/1_INT_DATE Inapplicable
SELF-EMPLOYED_AT_RD_3/1_CMJ Inapplicable
CHOICE_OF_HEALTH_PLANS_AT_RD_3/1_CMJ Inapplicable
INDUSTRY_GROUP_RD_3/1_CMJ INAPPLICABLE
UNION_STATUS_AT_RD_3/1_CMJ Inapplicable
OCCUPATION_GROUP_RD_3/1_CMJ INAPPLICABLE
HEALTH_INSUR_HELD_FROM_RD_3/1_CMJ_(ED) Inapplicable
HEALTH_INSUR_OFFERED_BY_RD_3/1_CMJ_(ED) Inapplicable
EMPLOYER_OFFERS_HEALTH_INS_RD_3/1_CMJ Inapplicable
FULL_YEAR_INSURANCE_COVERAGE_STATUS_2013 <65 ANY PRIVATE
ANYONE_IN_RU_HAVE_FSA_-_R3/1 No
#_WKS/MON_WOUT_HLTH_INS_PRV_YR-PN_18_ONL inapplicable
PRIVATE_INSURANCE_ANY_TIME_IN_R5/R3 Yes
PUBLIC_INS_ANY_TIME_IN_R5/R3 Yes
INSURED_ANY_TIME_IN_R3/1 Yes
ANY_TIME_COVERAGE_BY_STATE_INS_-_R3/1 No
#_OFFICE-BASED_PROVIDER_VISITS_13 168
#_OUTPATIENT_DEPT_PROVIDER_VISITS_13 2
#_EMERGENCY_ROOM_VISITS_13 3
#_NIGHTS_IN_HOSP_FOR_DISCHARGES_2013 0
FINAL_PERSON_WEIGHT_2013 10189.789734
TOTAL_OFFICE-BASED_EXP_13 109403
TOTAL_OUTPATIENT_PROVIDER_EXP_13 1122
TOT_HOSP_IP_FACILITY_+_DR_EXP_13 0
TOTAL_ER_FACILITY_+_DR_EXP_13 1878
CENSUS_REGION_AS_OF_12/31/13 South
EDUCATION_RECODE_(EDITED) 9 - 12TH GRADE, NO HS DIPLOMA OR GED
AGE_AS_OF_12/31/13_(EDITED/IMPUTED) 67
SEX FEMALE
RACE/ETHNICITY_(EDITED/IMPUTED) NON-HISPANIC WHITE ONLY
MARITAL_STATUS-12/31/13_(EDITED/IMPUTED) MARRIED
FAMILY'S_TOTAL_INCOME 126230
ADULT_BODY_MASS_INDEX_(>17)_-_RD_5/3 26.3
HIGH_BLOOD_PRESSURE_DIAG_(>17) Yes
CORONARY_HRT_DISEASE_DIAG_(>17) Yes
AGE_OF_DIAGNOSIS-HIGH_BLOOD_PRESSURE 35
MULT_DIAG_HIGH_BLOOD_PRESS_(>17) Yes
AGE_OF_DIAGNOSIS-CORONARY_HEART_DISEASE 44
ANGINA_DIAGNOSIS_(>17) Yes
AGE_OF_DIAGNOSIS-ANGINA 44
HEART_ATTACK_(MI)_DIAG_(>17) No
AGE_OF_DIAGNOSIS-HEART_ATTACK(MI) inapplicable
OTHER_HEART_DISEASE_DIAG_(>17) Yes
AGE_OF_DIAGNOSIS-OTHER_HEART_DISEASE 63
STROKE_DIAGNOSIS_(>17) Yes
AGE_OF_DIAGNOSIS-STROKE 66
EMPHYSEMA_DIAGNOSIS_(>17) No
AGE_OF_DIAGNOSIS-EMPHYSEMA inapplicable
CHRONC_BRONCHITS_LAST_12_MTHS_(>17)-R3/1 No
CHRONC_BRONCHITS_LAST_12_MTHS_(>17)-R5/3 No
HIGH_CHOLESTEROL_DIAGNOSIS_(>17) No
AGE_OF_DIAGNOSIS-HIGH_CHOLESTEROL inapplicable
CANCER_DIAGNOSIS_(>17) No
CANCER_DIAGNOSED_-_BLADDER_(>17) Inapplicable
CANCER_DIAGNOSED_-_BREAST_(>17) Inapplicable
CANCER_DIAGNOSED_-_CERVIX_(>17) Inapplicable
CANCER_DIAGNOSED_-_COLON_(>17) Inapplicable
CANCER_DIAGNOSED_-_LUNG_(>17) Inapplicable
CANCER_DIAGNOSED_-_LYMPHOMA_(>17) Inapplicable
CANCER_DIAGNOSED_-_MELANOMA_(>17) Inapplicable
CANCER_DIAGNOSED_-_OTHER_(>17) Inapplicable
CANCER_DIAGNOSED_-_PROSTATE_(>17) Inapplicable
CANCER_DIAGNOSED_-_SKIN-NONMELANO_(>17) Inapplicable
CANCER_DIAGNOSED-SKIN-UNKNOWN_TYPE_(>17) Inapplicable
CANCER_DIAGNOSED_-_UTERUS_(>17) Inapplicable
DIABETES_DIAGNOSIS_(>17) Yes
AGE_OF_DIAGNOSIS-DIABETES 38
JOINT_PAIN_LAST_12_MONTHS_(>17)_-_RD_3/1 No
JOINT_PAIN_LAST_12_MONTHS_(>17)_-_RD_5/3 No
ARTHRITIS_DIAGNOSIS_(>17) Yes
AGE_OF_DIAGNOSIS-ARTHRITIS 60
ASTHMA_DIAGNOSIS No
AGE_OF_DIAGNOSIS-ASTHMA inapplicable
DOES_PERSON_STILL_HAVE_ASTHMA-RD3/1 Inapplicable
DOES_PERSON_STILL_HAVE_ASTHMA_-_RD_5/3 Inapplicable
ASTHMA_ATTACK_LAST_12_MOS_-_RD3/1 Inapplicable
USED_ACUTE_PRES_INHALER_LAST_3_MOS-RD5/3 Inapplicable
USED>3ACUTE_CN_PRES_INH_LAST_3_MOS-RD5/3 Inapplicable
EVER_USED_PREV_DAILY_ASTHMA_MEDS_-RD_5/3 Inapplicable
NOW_TAKE_PREV_DAILY_ASTHMA_MEDS_-_RD_5/3 Inapplicable
HAVE_PEAK_FLOW_METER_AT_HOME_-_RD_5/3 Inapplicable
EVER_USED_PEAK_FLOW_METER_-_RD_5/3 Inapplicable
ADHDADD_DIAGNOSIS_(5-17) Inapplicable
AGE_OF_DIAGNOSIS-ADHD/ADD inapplicable
PREGNANT_DURING_REF_PERIOD_-_RD_3/1 Inapplicable
LIMITATION_IN_PHYSICAL_FUNCTIONING-RD3/1 Yes
EMPLOYMENT_STATUS_RD_3/1 NOT EMPLOYED DURING RD 3/1
HAS_MORE_THAN_ONE_JOB_RD_3/1_INT_DATE Inapplicable
SELF-EMPLOYED_AT_RD_3/1_CMJ Inapplicable
CHOICE_OF_HEALTH_PLANS_AT_RD_3/1_CMJ Inapplicable
INDUSTRY_GROUP_RD_3/1_CMJ INAPPLICABLE
UNION_STATUS_AT_RD_3/1_CMJ Inapplicable
OCCUPATION_GROUP_RD_3/1_CMJ INAPPLICABLE
HEALTH_INSUR_HELD_FROM_RD_3/1_CMJ_(ED) Inapplicable
HEALTH_INSUR_OFFERED_BY_RD_3/1_CMJ_(ED) Inapplicable
EMPLOYER_OFFERS_HEALTH_INS_RD_3/1_CMJ Inapplicable
FULL_YEAR_INSURANCE_COVERAGE_STATUS_2013 65+ EDITED MEDICARE AND PRIVATE
ANYONE_IN_RU_HAVE_FSA_-_R3/1 No
#_WKS/MON_WOUT_HLTH_INS_PRV_YR-PN_18_ONL inapplicable
PRIVATE_INSURANCE_ANY_TIME_IN_R5/R3 Yes
PUBLIC_INS_ANY_TIME_IN_R5/R3 Yes
INSURED_ANY_TIME_IN_R3/1 Yes
ANY_TIME_COVERAGE_BY_STATE_INS_-_R3/1 No
#_OFFICE-BASED_PROVIDER_VISITS_13 164
#_OUTPATIENT_DEPT_PROVIDER_VISITS_13 0
#_EMERGENCY_ROOM_VISITS_13 0
#_NIGHTS_IN_HOSP_FOR_DISCHARGES_2013 17
FINAL_PERSON_WEIGHT_2013 22234.388648
TOTAL_OFFICE-BASED_EXP_13 195295
TOTAL_OUTPATIENT_PROVIDER_EXP_13 0
TOT_HOSP_IP_FACILITY_+_DR_EXP_13 74030
TOTAL_ER_FACILITY_+_DR_EXP_13 0
CENSUS_REGION_AS_OF_12/31/13 West
EDUCATION_RECODE_(EDITED) BEYOND HS,COLLEGE(NO 4YR DEG),ASSOC DEG
AGE_AS_OF_12/31/13_(EDITED/IMPUTED) 10
SEX MALE
RACE/ETHNICITY_(EDITED/IMPUTED) NON-HISPANIC BLACK ONLY
MARITAL_STATUS-12/31/13_(EDITED/IMPUTED) UNDER 16 - INAPPLICABLE
FAMILY'S_TOTAL_INCOME 73568
ADULT_BODY_MASS_INDEX_(>17)_-_RD_5/3 inapplicable
HIGH_BLOOD_PRESSURE_DIAG_(>17) Inapplicable
CORONARY_HRT_DISEASE_DIAG_(>17) Inapplicable
AGE_OF_DIAGNOSIS-HIGH_BLOOD_PRESSURE inapplicable
MULT_DIAG_HIGH_BLOOD_PRESS_(>17) Inapplicable
AGE_OF_DIAGNOSIS-CORONARY_HEART_DISEASE inapplicable
ANGINA_DIAGNOSIS_(>17) Inapplicable
AGE_OF_DIAGNOSIS-ANGINA inapplicable
HEART_ATTACK_(MI)_DIAG_(>17) Inapplicable
AGE_OF_DIAGNOSIS-HEART_ATTACK(MI) inapplicable
OTHER_HEART_DISEASE_DIAG_(>17) Inapplicable
AGE_OF_DIAGNOSIS-OTHER_HEART_DISEASE inapplicable
STROKE_DIAGNOSIS_(>17) Inapplicable
AGE_OF_DIAGNOSIS-STROKE inapplicable
EMPHYSEMA_DIAGNOSIS_(>17) Inapplicable
AGE_OF_DIAGNOSIS-EMPHYSEMA inapplicable
CHRONC_BRONCHITS_LAST_12_MTHS_(>17)-R3/1 Inapplicable
CHRONC_BRONCHITS_LAST_12_MTHS_(>17)-R5/3 Inapplicable
HIGH_CHOLESTEROL_DIAGNOSIS_(>17) Inapplicable
AGE_OF_DIAGNOSIS-HIGH_CHOLESTEROL inapplicable
CANCER_DIAGNOSIS_(>17) Inapplicable
CANCER_DIAGNOSED_-_BLADDER_(>17) Inapplicable
CANCER_DIAGNOSED_-_BREAST_(>17) Inapplicable
CANCER_DIAGNOSED_-_CERVIX_(>17) Inapplicable
CANCER_DIAGNOSED_-_COLON_(>17) Inapplicable
CANCER_DIAGNOSED_-_LUNG_(>17) Inapplicable
CANCER_DIAGNOSED_-_LYMPHOMA_(>17) Inapplicable
CANCER_DIAGNOSED_-_MELANOMA_(>17) Inapplicable
CANCER_DIAGNOSED_-_OTHER_(>17) Inapplicable
CANCER_DIAGNOSED_-_PROSTATE_(>17) Inapplicable
CANCER_DIAGNOSED_-_SKIN-NONMELANO_(>17) Inapplicable
CANCER_DIAGNOSED-SKIN-UNKNOWN_TYPE_(>17) Inapplicable
CANCER_DIAGNOSED_-_UTERUS_(>17) Inapplicable
DIABETES_DIAGNOSIS_(>17) Inapplicable
AGE_OF_DIAGNOSIS-DIABETES inapplicable
JOINT_PAIN_LAST_12_MONTHS_(>17)_-_RD_3/1 Inapplicable
JOINT_PAIN_LAST_12_MONTHS_(>17)_-_RD_5/3 Inapplicable
ARTHRITIS_DIAGNOSIS_(>17) Inapplicable
AGE_OF_DIAGNOSIS-ARTHRITIS inapplicable
ASTHMA_DIAGNOSIS Yes
AGE_OF_DIAGNOSIS-ASTHMA 6
DOES_PERSON_STILL_HAVE_ASTHMA-RD3/1 Yes
DOES_PERSON_STILL_HAVE_ASTHMA_-_RD_5/3 Yes
ASTHMA_ATTACK_LAST_12_MOS_-_RD3/1 No
USED_ACUTE_PRES_INHALER_LAST_3_MOS-RD5/3 Yes
USED>3ACUTE_CN_PRES_INH_LAST_3_MOS-RD5/3 No
EVER_USED_PREV_DAILY_ASTHMA_MEDS_-RD_5/3 Yes
NOW_TAKE_PREV_DAILY_ASTHMA_MEDS_-_RD_5/3 Yes
HAVE_PEAK_FLOW_METER_AT_HOME_-_RD_5/3 No
EVER_USED_PEAK_FLOW_METER_-_RD_5/3 Inapplicable
ADHDADD_DIAGNOSIS_(5-17) No
AGE_OF_DIAGNOSIS-ADHD/ADD inapplicable
PREGNANT_DURING_REF_PERIOD_-_RD_3/1 Inapplicable
LIMITATION_IN_PHYSICAL_FUNCTIONING-RD3/1 Don't know
EMPLOYMENT_STATUS_RD_3/1 INAPPLICABLE
HAS_MORE_THAN_ONE_JOB_RD_3/1_INT_DATE Inapplicable
SELF-EMPLOYED_AT_RD_3/1_CMJ Inapplicable
CHOICE_OF_HEALTH_PLANS_AT_RD_3/1_CMJ Inapplicable
INDUSTRY_GROUP_RD_3/1_CMJ INAPPLICABLE
UNION_STATUS_AT_RD_3/1_CMJ Inapplicable
OCCUPATION_GROUP_RD_3/1_CMJ INAPPLICABLE
HEALTH_INSUR_HELD_FROM_RD_3/1_CMJ_(ED) Inapplicable
HEALTH_INSUR_OFFERED_BY_RD_3/1_CMJ_(ED) Inapplicable
EMPLOYER_OFFERS_HEALTH_INS_RD_3/1_CMJ Inapplicable
FULL_YEAR_INSURANCE_COVERAGE_STATUS_2013 <65 ANY PRIVATE
ANYONE_IN_RU_HAVE_FSA_-_R3/1 No
#_WKS/MON_WOUT_HLTH_INS_PRV_YR-PN_18_ONL inapplicable
PRIVATE_INSURANCE_ANY_TIME_IN_R5/R3 Yes
PUBLIC_INS_ANY_TIME_IN_R5/R3 Yes
INSURED_ANY_TIME_IN_R3/1 Yes
ANY_TIME_COVERAGE_BY_STATE_INS_-_R3/1 No
#_OFFICE-BASED_PROVIDER_VISITS_13 51
#_OUTPATIENT_DEPT_PROVIDER_VISITS_13 11
#_EMERGENCY_ROOM_VISITS_13 3
#_NIGHTS_IN_HOSP_FOR_DISCHARGES_2013 26
FINAL_PERSON_WEIGHT_2013 16134.188642
TOTAL_OFFICE-BASED_EXP_13 137891
TOTAL_OUTPATIENT_PROVIDER_EXP_13 3444
TOT_HOSP_IP_FACILITY_+_DR_EXP_13 59615
TOTAL_ER_FACILITY_+_DR_EXP_13 1163
CENSUS_REGION_AS_OF_12/31/13 South
EDUCATION_RECODE_(EDITED) LESS THAN/EQUAL TO 8TH GRADE
AGE_AS_OF_12/31/13_(EDITED/IMPUTED) 68
SEX FEMALE
RACE/ETHNICITY_(EDITED/IMPUTED) NON-HISPANIC OTHER RACE OR MULTIPLE RACE
MARITAL_STATUS-12/31/13_(EDITED/IMPUTED) DIVORCED
FAMILY'S_TOTAL_INCOME 31349
ADULT_BODY_MASS_INDEX_(>17)_-_RD_5/3 27.8
HIGH_BLOOD_PRESSURE_DIAG_(>17) Yes
CORONARY_HRT_DISEASE_DIAG_(>17) No
AGE_OF_DIAGNOSIS-HIGH_BLOOD_PRESSURE 52
MULT_DIAG_HIGH_BLOOD_PRESS_(>17) Yes
AGE_OF_DIAGNOSIS-CORONARY_HEART_DISEASE inapplicable
ANGINA_DIAGNOSIS_(>17) No
AGE_OF_DIAGNOSIS-ANGINA inapplicable
HEART_ATTACK_(MI)_DIAG_(>17) No
AGE_OF_DIAGNOSIS-HEART_ATTACK(MI) inapplicable
OTHER_HEART_DISEASE_DIAG_(>17) Yes
AGE_OF_DIAGNOSIS-OTHER_HEART_DISEASE inapplicable
STROKE_DIAGNOSIS_(>17) No
AGE_OF_DIAGNOSIS-STROKE inapplicable
EMPHYSEMA_DIAGNOSIS_(>17) No
AGE_OF_DIAGNOSIS-EMPHYSEMA inapplicable
CHRONC_BRONCHITS_LAST_12_MTHS_(>17)-R3/1 No
CHRONC_BRONCHITS_LAST_12_MTHS_(>17)-R5/3 No
HIGH_CHOLESTEROL_DIAGNOSIS_(>17) Yes
AGE_OF_DIAGNOSIS-HIGH_CHOLESTEROL 59
CANCER_DIAGNOSIS_(>17) No
CANCER_DIAGNOSED_-_BLADDER_(>17) Inapplicable
CANCER_DIAGNOSED_-_BREAST_(>17) Inapplicable
CANCER_DIAGNOSED_-_CERVIX_(>17) Inapplicable
CANCER_DIAGNOSED_-_COLON_(>17) Inapplicable
CANCER_DIAGNOSED_-_LUNG_(>17) Inapplicable
CANCER_DIAGNOSED_-_LYMPHOMA_(>17) Inapplicable
CANCER_DIAGNOSED_-_MELANOMA_(>17) Inapplicable
CANCER_DIAGNOSED_-_OTHER_(>17) Inapplicable
CANCER_DIAGNOSED_-_PROSTATE_(>17) Inapplicable
CANCER_DIAGNOSED_-_SKIN-NONMELANO_(>17) Inapplicable
CANCER_DIAGNOSED-SKIN-UNKNOWN_TYPE_(>17) Inapplicable
CANCER_DIAGNOSED_-_UTERUS_(>17) Inapplicable
DIABETES_DIAGNOSIS_(>17) Yes
AGE_OF_DIAGNOSIS-DIABETES 52
JOINT_PAIN_LAST_12_MONTHS_(>17)_-_RD_3/1 Yes
JOINT_PAIN_LAST_12_MONTHS_(>17)_-_RD_5/3 Yes
ARTHRITIS_DIAGNOSIS_(>17) Yes
AGE_OF_DIAGNOSIS-ARTHRITIS 52
ASTHMA_DIAGNOSIS No
AGE_OF_DIAGNOSIS-ASTHMA inapplicable
DOES_PERSON_STILL_HAVE_ASTHMA-RD3/1 Inapplicable
DOES_PERSON_STILL_HAVE_ASTHMA_-_RD_5/3 Inapplicable
ASTHMA_ATTACK_LAST_12_MOS_-_RD3/1 Inapplicable
USED_ACUTE_PRES_INHALER_LAST_3_MOS-RD5/3 Inapplicable
USED>3ACUTE_CN_PRES_INH_LAST_3_MOS-RD5/3 Inapplicable
EVER_USED_PREV_DAILY_ASTHMA_MEDS_-RD_5/3 Inapplicable
NOW_TAKE_PREV_DAILY_ASTHMA_MEDS_-_RD_5/3 Inapplicable
HAVE_PEAK_FLOW_METER_AT_HOME_-_RD_5/3 Inapplicable
EVER_USED_PEAK_FLOW_METER_-_RD_5/3 Inapplicable
ADHDADD_DIAGNOSIS_(5-17) Inapplicable
AGE_OF_DIAGNOSIS-ADHD/ADD inapplicable
PREGNANT_DURING_REF_PERIOD_-_RD_3/1 Inapplicable
LIMITATION_IN_PHYSICAL_FUNCTIONING-RD3/1 Yes
EMPLOYMENT_STATUS_RD_3/1 NOT EMPLOYED DURING RD 3/1
HAS_MORE_THAN_ONE_JOB_RD_3/1_INT_DATE Inapplicable
SELF-EMPLOYED_AT_RD_3/1_CMJ Inapplicable
CHOICE_OF_HEALTH_PLANS_AT_RD_3/1_CMJ Inapplicable
INDUSTRY_GROUP_RD_3/1_CMJ INAPPLICABLE
UNION_STATUS_AT_RD_3/1_CMJ Inapplicable
OCCUPATION_GROUP_RD_3/1_CMJ INAPPLICABLE
HEALTH_INSUR_HELD_FROM_RD_3/1_CMJ_(ED) Inapplicable
HEALTH_INSUR_OFFERED_BY_RD_3/1_CMJ_(ED) Inapplicable
EMPLOYER_OFFERS_HEALTH_INS_RD_3/1_CMJ Inapplicable
FULL_YEAR_INSURANCE_COVERAGE_STATUS_2013 65+ EDITED MEDICARE AND PRIVATE
ANYONE_IN_RU_HAVE_FSA_-_R3/1 No
#_WKS/MON_WOUT_HLTH_INS_PRV_YR-PN_18_ONL inapplicable
PRIVATE_INSURANCE_ANY_TIME_IN_R5/R3 No
PUBLIC_INS_ANY_TIME_IN_R5/R3 Yes
INSURED_ANY_TIME_IN_R3/1 Yes
ANY_TIME_COVERAGE_BY_STATE_INS_-_R3/1 No
#_OFFICE-BASED_PROVIDER_VISITS_13 128
#_OUTPATIENT_DEPT_PROVIDER_VISITS_13 9
#_EMERGENCY_ROOM_VISITS_13 3
#_NIGHTS_IN_HOSP_FOR_DISCHARGES_2013 7
FINAL_PERSON_WEIGHT_2013 15650.765936
TOTAL_OFFICE-BASED_EXP_13 183384
TOTAL_OUTPATIENT_PROVIDER_EXP_13 484
TOT_HOSP_IP_FACILITY_+_DR_EXP_13 10670
TOTAL_ER_FACILITY_+_DR_EXP_13 1600
CENSUS_REGION_AS_OF_12/31/13 Northeast
EDUCATION_RECODE_(EDITED) BEYOND HS,COLLEGE(NO 4YR DEG),ASSOC DEG
It looks like the outliers consist of respondents who are chronically unwell, and this may be driving heavy treatment regimens.
Given the evidence, I'm inclined to leave these respondents in the dataset for now. While they are a small proportion of the population, their cases may be indicative of the financial consequences of unmanaged illness. I'll have to make sure their inclusion is explicitly mentioned in my model writeup, since I am making a call on whether these points are corrupt data.
Next, let's look at the overall distribution of expenditure on office based services.
The graph above is a histogram of billed charges in the dataset. This graph is difficult to read because of the scale. I might try and break it in to a few pieces: one consisting of data points near zero (perhaps one with zero and one with strictly positive values), another consisting of data points that are positive but not too large (say less than 50k or so) and the final piece consisting of the extremely large positive values.
Let's try this:
I want to point out here that the dataset consists of weighted points; each point carries a weight (the column is "FINAL_PERSON_WEIGHT_2013") such that when the dataset is scaled out pointwise by these weights, it gives a representative sampling of the U.S. population. Since I do have these weights given to me, I've used these in my histograms to get a more accurate sense of the distribution of these features in the U.S. population.
This graph clearly shows the extent of the zero clustering. It looks like over 80m people are estimated to have expenses of less than $\$10$ - this is about one-fourth of the population.
Next, let's take a look at the same range, but without the mass at 0.
This shows that the modal nonzero expense for office services is about $\$100-\$200$ dollars, with about 32 million people falling into that bucket.
Let's look further out.
There are 240259742 people in the range 0 to 1000
There are 49350962 people in the range 1000 to 3000
There are 8539737 people in the range 5000 to 10000
I've used a bin width of $\$100$ again, to keep things consistent. It looks like about 240m people spend less than $\$1000$ a year for office based services. Significant numbers of people (about 50m people) spend between $\$1000$ and $\$3000$ on office based services. After that, each $\$100$ bin contains relatively few people, but these are at fairly high levels of expenditure for an average person or family.
There is quite a bit of mass in the range $\$5000 - \$10,000$. These people, at 8.5 million people, or about $2.7\%$ of the population, are paying a lot for office based healthcare services. It would be interesting to know how much of expenditure at this level is discretionary or non-essential, and how much of it is for relatively routine healthcare - perhaps for badly managed chronic illnesses, etc. In terms of informing policy, there's nothing wrong with the former, but since chronic illness tends to affect a broad section of people - not all of whom can afford to pay upto $\$10,000$ a year for healthcare services, there may be something to worry about if a lot of this high expenditure is for non-discretionary services.
Let's look a bit further:
There are 4452543 people in the range 10000 to 50000
There is significant mass at these higher levels of expenditure - it's a small proportion of the population (4.5 million, or about $1.4\%$ of the population), but the severity of the costs is non-trivial. It will be very interesting to see if we can find out more about what characterizes people who tend to pay this much for healthcare in a year. Do they have terminal illnesses? Badly managed chronic illnesses? Severe accidents?
Let's look at the tail end of the distribution:
There are 181340 people in the range 50000 to 250000
There are about 1.8 million people in this range, and the severity of the costs they face is astonishingly high. Again, it will be extremely interesting to know what about these people drives this level of expenditure.
Next, let's take a look at the other three predicted variables, in the following order:
- Outpatient services
- Inpatient services
- Emergency Room services
Out[52]:
<matplotlib.axes._subplots.AxesSubplot at 0x7efd08e73438>
Clearly, there's a lot of zero clustering going on here as well. Let's break it out into pieces like we did for office based services, and see what we find.
There are 310041464 people in the range 0 to 1000
It looks like the vast majority of respondents pay less than $\$10$ for outpatient services.
Let's see what happens if we drop the zero payments:
There are 23065896 people in the range 1 to 1000
There are 300065999 people in the range 0 to 200
Of the order of 300m people incur charges of a couple of hundred dollars for outpatient services in the year.
Let's zoom in a bit more to that area and see if we can figure out what the modal charge is.
The modal charge seems to be $\$30-\$40$ dollars for outpatient services.
Let's now look further to the right to see what the high expenditure cases look like:
There are 5603240 people in the range 1000 to 10000
There are 1168142 people in the range 3000 to 10000
There are about 5.6m people in this range, which is about $1.75\%$ of the population. There are about 1.1m people, or $0.34\%$ of respondents that are paying over $\$3,000$ in this range.
Let's take a look at the rest of the data.
There are 77278 people in the range 10000 to 50000
Not many people here - about 77,000. The maximum amount paid by any respondent is $\$26,703$, so we won't find mass any further to the right.
It looks like outpatient expenditure follows a similar pattern to office based expenditure, although the severity is less extreme.
Let's turn to inpatient expenditure.
Out[61]:
count 36940.000000
mean 977.643097
std 7413.503502
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 320929.000000
Name: TOT_HOSP_IP_FACILITY_+_DR_EXP_13, dtype: float64
Alright, so we have a max of about $\$320,000$. That's extremely high!
Let's go through the same steps as before:
Again, most people pay less than $\$10$.
What if we drop the mass at 0?
Let's move out a bit further:
Ok, quite a lot of mass in this range. Charges in the thousands of dollars seem more likely in the context of inpatient services, since this usually includes procedures like surgeries.
Let's look out past $\$10,000$.
There are 3910489 people in the range 20000 to 50000
This is consistent with the kinds of amounts we're used to hearing about being charged for surgeries, etc. About 4m people seem to have inpatient services costing over $\$20,000$; these amounts are quite severely high. Let's see if it gets worse for anyone.
There are 1381075 people in the range 50000 to 350000
For the 1.4m people in this range, the amounts charged are more than the median household's annual income in the United States. Anyone that falls into this range - and it looks like many people do - will be at great financial risk.
I'd really like to know if the people who end up in this range have anything in common, or whether it's mostly random. Are these life-threatening accidents? Terminal illness? Badly managed chronic illness? Being able to tell the difference, if there is any, could be very informative for healthcare policy, and for advising healthcare consumers about their financial risk.
Let's move on to the final predicted variable, ER expenditure.
Out[70]:
count 36940.000000
mean 176.546833
std 1029.557499
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 63179.000000
Name: TOTAL_ER_FACILITY_+_DR_EXP_13, dtype: float64
The maximum charge for ER services is $\$63,179$. This is a lot lower than the maximum for inpatient services, but still very high.
Let's look at this close up.
Fewer people at $\$0$ than for outpatient or inpatient expenditure, but more than for office based expenditure.
Let's drop the $\$0$ charges and see what it looks like.
There are more people in this range than there were for inpatient services, though less than for office based services and outpatient services. A side-by-side comparison:
While the overall utilization of ER services is lower than for office based services at each cost level, there seem to be many more people using ER services at the higher end of this range than either outpatient or inpatient services.
It's hard to say whether this means anything at this point, but it is interesting nonetheless.
Medical emergencies often come with significant sticker shock, so I'm curious to see what this looks like at much higher levels of expenditure.
Let's go a bit further...
Over 4000 people are estimated to see bills in excess of $\$60,000$ in a year.
I'd like to compare these ranges across categories for a moment.
Again, ER services are the second most frequently used at each cost level in the $\$1000 - \$10,000$ range.
Let's look at the $\$10,000 - \$50,000$ range.
In the $\$10,000 - \$50,000$ range, ER services give way to inpatient services to become the third most frequently used at each cost level. (This is an approximate statement, I'm not checking each bin.)
Finally, let's look at the tail end:
It looks much the same in this range.
My main takeaway from doing this is that the expenditure for these four location based services follows a reasonable pattern for most people. That is, the majority of people will see charges for healthcare expenses that are reasonably insurable if on a plan with a low deductible, but that may bite for the average household on a plan with a relatively high deductible.
One important issue these plots raise is the following: there is a minority of respondents that incur charges for healthcare that are so high as to be clearly financially ruinous for the median household. Questions to ask include:
- Do the respondents who incur these extremely high charges have the financial means to bear them? Or are they likely to go bankrupt?
- What characterizes the respondents with extremely high healthcare costs? Do certain illnesses come into play? Do these charges come at random? Are some people at greater risk than others? Would it help anyone if they were advised that they were at elevated financial risk because of some characteristic they have? Should they have a particular health insurance plan? Can I use any of these insights to build a better planning tool?
These are important questions, and it would be fantastic if I could use my Insight project to shed some light on them, whether as part of the modeling process, or (even better) by integrating them into my webapp.
In the next post, I'll take a look at the explanatory variables.
Click to read and post comments
Dec 21, 2015
The next step in giving my project legs was to actually get data into my analysis stack. For this project, I planned on using a pandas/scikit-learn/statsmodels stack.
To this end, I needed to get a hold of the code dictionary available on this webpage, and use it to rewrite the 2013 MEPS dataset as a comma separated values file.
The first thing to do is to instantiate a scraper using the BeautifulSoup and requests libraries.
I may run this analysis on several MEPS datasets in the future, so I'm going to hardcode a number of pages into a dictionary for easy access later. Here's a function that accesses the dictionary using project names.
So for instance, calling it on "2013" yields:
Out[3]:
'http://meps.ahrq.gov/mepsweb/data_stats/download_data_files_codebook.jsp?PUFId=H163'
Let's write a function that returns the parse tree for the html we need:
And let's check that this worked for our page:
Out[6]:
<title>Medical Expenditure Panel Survey PUF Codebook</title>
I fiddled around with this object for a while and discovered that the data I needed all lay inside bits tagged "font".
Out[7]:
[<font color="#660000">::</font>,
<font color="#660000">::</font>,
<font color="#660000">::</font>,
<font color="#660000">::</font>,
<font color="#660000">::</font>,
<font color="#660000">::</font>,
<font color="#660000">::</font>,
<font color="#660000">::</font>,
<font color="#660000">::</font>,
<font color="#660000">::</font>,
<font color="#660000">::</font>,
<font color="#660000">::</font>,
<font color="#660000">::</font>,
<font color="#660000">::</font>,
<font color="#660000">::</font>,
<font color="#660000">::</font>,
<font color="#660000">::</font>,
<font color="#660000">::</font>,
<font class="smallBlack" face="Arial">MEPS HC-163</font>,
<font class="smallBlack" face="Arial">2013 FULL YEAR CONSOLIDATED DATA CODEBOOK</font>,
<font class="smallBlack" face="Arial">DATE: October 7, 2015</font>,
<font class="smallBlack" face="Arial">940 </font>,
<font class="smallBlack" face="Arial">941 </font>,
<font class="smallBlack" face="Arial">PERS ELIGIBLE FOR ACCESS SUPPLEMENT-R4/2</font>,
<font class="smallBlack" face="Arial">241 </font>,
<font class="smallBlack" face="Arial">242 </font>,
<font class="smallBlack" face="Arial">MILITARY FULL-TIME ACTIVE DUTY - R3/1</font>,
<font class="smallBlack" face="Arial">243 </font>,
<font class="smallBlack" face="Arial">244 </font>,
<font class="smallBlack" face="Arial">MILITARY FULL-TIME ACTIVE DUTY - R4/2</font>,
<font class="smallBlack" face="Arial">245 </font>,
<font class="smallBlack" face="Arial">246 </font>,
<font class="smallBlack" face="Arial">MILITARY FULL-TIME ACTIVE DUTY - R5/3</font>,
<font class="smallBlack" face="Arial">473 </font>,
<font class="smallBlack" face="Arial">474 </font>,
<font class="smallBlack" face="Arial">ANY LIMITATION WORK/HOUSEWRK/SCHL-RD 3/1</font>,
<font class="smallBlack" face="Arial">475 </font>,
<font class="smallBlack" face="Arial">476 </font>,
<font class="smallBlack" face="Arial">ANY LIMITATION WORK/HOUSEWRK/SCHL-RD 5/3</font>,
<font class="smallBlack" face="Arial">734 </font>,
<font class="smallBlack" face="Arial">735 </font>,
<font class="smallBlack" face="Arial">SAQ 12 MOS: # VISITS TO MED OFF FOR CARE</font>,
<font class="smallBlack" face="Arial">786 </font>,
<font class="smallBlack" face="Arial">787 </font>,
<font class="smallBlack" face="Arial">SAQ 4WKS: FELT CALM/PEACEFUL SF-12V2</font>,
<font class="smallBlack" face="Arial">774 </font>,
<font class="smallBlack" face="Arial">775 </font>,
<font class="smallBlack" face="Arial">SAQ: HLTH LIMITS CLIMBING STAIRS SF-12V2</font>,
<font class="smallBlack" face="Arial">834 </font>,
<font class="smallBlack" face="Arial">835 </font>,
<font class="smallBlack" face="Arial">SAQ: DATE COMPLETED - MONTH</font>,
<font class="smallBlack" face="Arial">836 </font>,
<font class="smallBlack" face="Arial">839 </font>,
<font class="smallBlack" face="Arial">SAQ: DATE COMPLETED - YEAR</font>,
<font class="smallBlack" face="Arial">772 </font>,
<font class="smallBlack" face="Arial">773 </font>,
<font class="smallBlack" face="Arial">SAQ: HLTH LIMITS MOD ACTIVITIES SF-12V2</font>,
<font class="smallBlack" face="Arial">790 </font>,
<font class="smallBlack" face="Arial">791 </font>,
<font class="smallBlack" face="Arial">SAQ 4WKS: FELT DOWNHEARTED/DEPR SF-12V2</font>]
These tagged pieces contain the variable description, preceded by the start and stop columns for the position of the field in the fixed-width file.
For example:
<font class="smallBlack" face="Arial">353 </font>,
<font class="smallBlack" face="Arial">354 </font>,
<font class="smallBlack" face="Arial">CANCER DIAGNOSED - MELANOMA (>17)</font>,
The next function extracts all the tags in the parse tree with attribute "font" and collects them in a list, that I (in very poor taste) called final. In my defense, the list is really nothing, and nothing is ever final, so there you go.
I then use a helper function I called "prune_list" to get rid of the crud I didn't want to deal with that also happened to be part of a "font" tag. Finally, this function writes the code dictionary table to the file provided as the second argument to the function.
Here's the definition of prune_list for reference:
Finally, here's the function that takes the MEPS dataset, and a dataframe containing the column dictionary, and writes the dataset out into a comma separated values file:
Now that we have the dataset written to disk in a format that I can use, in the next post I'll take a look at what the dataset contains and try to get a sense of the questions I might be able to answer with it.
Click to read and post comments
In my last post, I looked at several datasets that CMS provides, and worried that none of them were adequate to answer the question at hand: what drives healthcare spending in the United States?
What then jumpstarted my Insight project in its first week was a tip from one of the Insight NYC program directors, Friederike Schueuer, who suggested I look into the Medical Expenditure Panel Survey dataset.
This turned out to be something of a jackpot. The MEPS panel dataset has been assembled (under different names) since 1996, and surveys about 35,000 people a year. It collects detailed information on (amongst other things) their demographics, household composition, employment, income, socio-economic status, insurance status, health status (including chronic conditions) and how much they were charged for healthcare over the year in four different provider settings: office-based, outpatient, inpatient, and emergency room.
What turned out to be harder about working with this dataset was the fact that the data was only provided in a flat-file format with no delimiters. It turns out that the term for such files is "fixed-width", where each record appears in its own row (terminated by the carriage return character) and the entries for each column appear within fixed columns in the dataset. For example, the record ID for each record might occupy exactly the first five characters of the record, the insurance status might be recorded using characters in columns 101-105, etc. The only problem with this was that the only place I seemed to be able to find the correspondence between variables and column numbers was in an html file on the web entirely separate from the dataset itself. For example, this is the codebook for the 2013 dataset, and there were over 1700 variables in the dataset.
In the next post, I'll detail how I went about getting a hold of the code dictionary by using a web scraper, and then rewriting the MEPS dataset into a comma separated file so I could load it into an analysis tool like (Python) pandas.
Click to read and post comments
Dec 20, 2015
In this post, I'll outline how I made decisions about the data I chose to use to build the model behind my Insight project.
There were a few choices to be made, and a lot of constraints to consider. Something I came across in looking up data sources for this project was the contrast between the richness of the data currently available when it came to healthcare expenditure in the public (Medicare, Medicaid) sector, and how useful this data was for solving the problem I wanted to address.
To be more specific, there is a wealth of data available from CMS that sheds light on utilization and expenditure patterns for Medicare/Medicaid patients. However, it's hard to say to what extent patterns in this market are reflected in the private payor market - for patients with private individual plans or employer sponsored insurance. It's reasonable to be concerned that these two markets are sufficiently different in terms of market dynamics that what drives costs in the public payor market won't translate to the private payor market.
One dataset that caught my attention despite this issue was the Medicare Provider Utilization and Payment Data set. This is a highly detailed dataset reporting information that includes: NPI (National Provider ID), provider location information (physical address), provider specialty, provider Medicare participation, HCPCS (Healthcare Common Procedure Coding System) code and description, average and standard deviation of reimbursement under Medicare (including coinsurance, deductible and any third party payments) and the average and standard deviation of the submitted charge (what the provider actually bills). Medicare reimbursement rates are commonly held to track operating costs for providers, and so the ratio of submitted charges to Medicare reimbursement for each procedure code and in each location can be used as a rough proxy for the markups over cost seen at these facilities.
Amongst other things, the markup proxy may be useful as a way to estimate billed charges for consumers, if the Medicare reimbursement rates for a specific condition are already known.
More interestingly, (spoiler alert!) one of the conclusions I drew at the end of my Insight project was that I expect the relative market power of the provider and the payor to be a strong determinant of what a healthcare consumer is billed. (It turned out that demographic, socioeconomic and health status information were useful, but nowhere near sufficient to explain the statistical variation in billed charges.) Given the results of my project, I've come to believe that considering market power needs to be a crucial part of the discussion on managing healthcare costs and improving healthcare outcomes.
This perspective is supported by a recent paper by Cooper, Craig, Gaynor and van Reenen from earlier this month. They show that the primary determinant for the variation in healthcare spending for privately insured consumers is providers' transaction prices; further, hospital prices are positively associated with indicators of hospital market power. They show that even after controlling for a multitude of demand and cost factors, prices at hospitals enjoying monopoly power in a hospital referral region are 15.3% higher than in markets with four or more hospitals.
Another exciting feature of this paper is their use of a novel dataset provided by the Healthcare Cost Institute (HCCI) containing claims data submitted by Aetna, Humana and United Healthcare. Why is this exciting?
The Kaiser Family Foundation has compiled data estimating that in 2014, Medicare covered 13% of Americans, while Medicaid covered 19% of Americans. Combined, this accounts for about a third of the U.S. population, which contrasts with 49% of Americans covered under an employer sponsored plan. While Medicare and Medicaid are likely to be the most powerful payors in the country (in terms of pricing power), private ESIs are likely to have much less pricing power relative to providers, while covering many more people. Therefore, one can imagine that understanding the private payor market will be crucial to understanding what drives healthcare expenditure in the U.S.
The Cooper-Craig-Gaynor-van Reenen paper has also been featured in the news, which may make for better reading if you're not in the mood for an academic paper.
For yet another perspective on the importance of market power in controlling healthcare costs, see Nicholas Bagley's recent post drawing attention to the obstructions strewn in the path of those seeking more and better data on the private payor market.
Given all this, I'm somewhat interested to see whether the markup proxies that can be obtained from the aforementioned Medicare Provider Utilization and Payment Data dataset can be sensibly used (perhaps after joining with another database with a more comprehensive list of providers in various referral regions) to estimate providers' pricing power.
It was unfortunately impossible to run this experiment within the short timeframe of an Insight project, but it's definitely something I'd like to investigate further in the coming months.
Well, I'm currently at a loss as to how to proceed. In the next post, I'll detail the dataset that I did discover to be useful, and what was helpful about it.
Click to read and post comments
In the next set of posts, I'm going to introduce my Insight project.
Prompted by Jake Klamka's suggestion that we try to work on something that we feel would be impactful, I gave some thought during the first week of Insight to the question of the healthcare market in the United States.
This was far from being the first time that I'd had to think about this question. As an undergraduate, I remember reading a comparative analysis of healthcare expenditure in OECD countries, in which the point was made that the U.S. led the OECD in healthcare expenditure as a percentage of GDP, but was far from leading OECD countries in terms of allocation efficiency.
Since the ACA was passed, I've been quite interested to read economic policy papers and blogs covering the healthcare market. One that I've found quite interesting is The Incidental Economist. (This is not an endorsement of TIE bloggers' conclusions, views, or anything else.)
The primary market failure I've been interested in for a while is the lack of pricing transparency in healthcare. Anyone who's tried to get an estimate or quote from a hospital for a visit, test or procedure before said visit, test or procedure will be acutely aware of how frustrating this exercise is. Many people have found that asking for this information almost always nets you at best an evasive answer along the lines of ''it depends on a million things, just come by and we'll send you a bill later'', or at worst an outright refusal to answer the question. The uncertainty introduced by the combination of this behavior, and the tremendous and variable markups on healthcare services make for poor allocative efficiency, I think. In plain English, it's not good business.
Questions I wanted to answer, therefore, included: what drives prices in the healthcare market? Is there a way to predict ahead of time what a person can expect to be billed for healthcare services over a year? What explains the variation in billed charges for healthcare between different people and different providers? Is there a way to provide this information to consumers to enable them to make better choices when it comes to their provider, treatment or lifestyle?
My initial goal for my Insight project was a bit ambitious: I wanted to build an application for consumers to explore what their billed charges for healthcare would be over the next year. A natural application of these insights would be to help consumers compare health insurance plans.
Over the next few posts, I'll outline how I went about doing this.
Click to read and post comments
In September of this year, I took a leave of absence to do the Insight Data Science program in New York City. Insight is a postdoctoral program that helps Fellows to transition from academia to doing data science in industry.
I can't say enough good things about the Insight program, but let me at least offer a few impressions.
The minimum requirements to enter include a Ph.D., and the program tends to accept applicants who are already well prepared to be data scientists (perhaps 90-95% of the way there), and polishes them for the last mile of the transition. The program is fairly selective; we were told that for the September session, Insight received over 900 applicants, and accepted about 70 people. While this is certainly not to say that people who don't make it into Insight aren't awesome, the selectiveness does ensure a talented cohort, and my fellow Fellows (this term being a running pun throughout the program) were one of the best things about the program.
Once in the program, Fellows take a week or so to decide on what they'll do for their Insight project, which is a data science project that they'll showcase to mentoring companies towards the end of the program. We're encouraged to try working on something that is do-able in a 2-3 week timeframe, but that is also impactful in some way. There is also the option to do one of a handful of consulting projects, which consist of well defined projects that real, existing startup companies need to get done, but don't have the manpower or funding to implement.
We then spend the next 2-3 weeks implementing our chosen projects in 1 week iterations (affectionately called MVP's - as used in agile development and popularized by Eric Ries). Not all of our time is available for developing our projects; during this period we are visited by a lot of mentor companies that are interested in telling us about themselves, and what they do with data scientists. I think I slept an average of 4 hours a night during these weeks, and I definitely pulled a couple of all-nighters.
In the final phase of the program, we pivot to demo-ing at mentor companies, as well as prepping for the interview stack that most companies have for data scientists. We've been informed that that the bar for admission to be a practicing data scientist in industry has risen recently, and that the interview loops have become more rigorous and demanding at a lot of companies. This makes a lot of sense: with competitor programs setting up shop, there's been an increase in the supply of candidates like us into the data scientist market. The rapid creation of other bootcamps has also increased the supply of data scientists with different, including non-Ph.D., backgrounds. Finally, the emergence of many university-based data science and analytics masters programs (here's one example close to home that I've helped to build) promises to channel yet more talent into the marketplace.
Challenge accepted.
Click to read and post comments
I've set up this blog in order to encourage me to write on research topics that I'm interested in, and on projects that I'm currently doing.
I've found in the past that teaching has been a great way of learning new things well. This blog is a way to force me to reason through and explain what I'm working on in ways that are clear, detailed and that relate to concrete project goals.
Click to read and post comments
Nov 29, 2015
This is my first post!
def assigns(m,c):
def theline():
return (lambda x: m*x+c)
return theline
This function takes parameters \(m\) and \(c\) and returns a constructor for a linear function \(y=m*x+c\).
Let's try this out:
newline_generator=assigns(1,1)
theline=newline_generator()
That's all she wrote.
Click to read and post comments